Skip to main content
Microbial Genomics logoLink to Microbial Genomics
. 2023 Mar 23;9(3):mgen000961. doi: 10.1099/mgen.0.000961

ShigaPass: an in silico tool predicting Shigella serotypes from whole-genome sequencing assemblies

Iman Yassine 1,2, Elisabeth E Hansen 1,, Sophie Lefèvre 1, Corinne Ruckly 1, Isabelle Carle 1, Monique Lejay-Collin 1, Laetitia Fabre 1, Rayane Rafei 2, Maria Pardos de la Gandara 1, Fouad Daboussi 2, Ahmad Shahin 2,3, François-Xavier Weill 1,*
PMCID: PMC10132075  PMID: 36951906

Abstract

Shigella is one of the commonest causes of diarrhoea worldwide and a major public health problem. Shigella serotyping is based on a standardized scheme that splits Shigella strains into four serogroups and 60 serotypes on the basis of biochemical tests and O-antigen structures. This conventional serotyping method is laborious, time-consuming, impossible to automate, and requires a high level of expertise. Whole-genome sequencing (WGS) is becoming more affordable and is now used for routine surveillance, opening up possibilities for the development of much-needed accurate rapid typing methods. Here, we describe ShigaPass, a new in silico tool for predicting Shigella serotypes from WGS assemblies on the basis of rfb gene cluster DNA sequences, phage and plasmid-encoded O-antigen modification genes, seven housekeeping genes (EnteroBase’s MLST scheme), fliC alleles and clustered regularly interspaced short palindromic repeats (CRISPR) spacers. Using 4879 genomes, including 4716 reference strains and clinical isolates of Shigella characterized with a panel of biochemical tests and serotyped by slide agglutination, we show here that ShigaPass outperforms all existing in silico tools, particularly for the identification of Shigella boydii and Shigella dysenteriae serotypes, with a correct serotype assignment rate of 98.5 % and a sensitivity rate (i.e. ability to make any prediction) of 100 %.

Keywords: Shigella, ShigaPass, in silico serotyping, whole-genome sequencing

Data Summary

All the genomes used in this study are publicly accessible. The short-read sequence data have been submitted to the European Nucleotide Archive (ENA, https://www.ebi.ac.uk/ena/) under study accession numbers PRJEB44801, PRJEB2846 and PRJEB2128. Other whole-genome sequences analysed during the study are available from ENA, NCBI RefSeq (https://www.ncbi.nlm.nih.gov/refseq/), DDBJ (https://www.ddbj.nig.ac.jp/index-e.html) and GenBank (https://www.ncbi.nlm.nih.gov/genbank/). All the accession numbers of the genomes used in this study are listed in Supplementary Data 1 (available in the online version of this article). The genomes from the development or validation dataset can be selected via the ‘ShigaPass_DataSet’ field in Supplementary Data 1. ShigaPass and all the databases used (rfb, fliC, CRISPR and MLST) are publicly available from GitHub (https://github.com/imanyass/ShigaPass). The authors confirm that all the supporting data, source codes and protocols are provided within the article or in supplementary data files.

Impact Statement.

Shigella is a specialized lineage of Escherichia coli consisting of four serogroups and 60 serotypes. The enteric infections caused by human-pathogenic Shigella are increasingly resistant to antibiotics and require continual close monitoring. The conventional serotyping method has several limitations and may yield unreliable results. Public health laboratories are, therefore, increasingly shifting to whole-genome sequencing (WGS) for bacterial characterization. Here, we describe ShigaPass, a new in silico tool for predicting Shigella serotypes from complete and draft assemblies. Unlike other existing in silico tools, ShigaPass uses various genetic structures to predict Shigella serotype and to differentiate between Shigella , entero-invasive Escherichia coli (EIEC) and non- Shigella /EIEC isolates. This tool was validated on 4268 serotyped and sequenced clinical isolates of Shigella and compared with ShigaTyper and ShigEiFinder. ShigaPass version 1.5.0 outperformed these existing in silico tools, yielding the best concordance with the current Shigella serotyping scheme (98.5%) with a sensitivity (i.e. ability to make any prediction) of 100%, ensuring robust backward compatibility. In particular, ShigaPass had a superior performance for the prediction of all Shigella boydii and Shigella dysenteriae serotypes, even the rare ones. This validation study shows that ShigaPass can be used with confidence by reference laboratories. ShigaPass has been used for the routine genomic surveillance of shigellosis in France since October 2021. A standalone version of ShigaPass is available from GitHub and an online version is available from Galaxy.

Introduction

Shigella is a rod-shaped Gram-negative bacterium from the family Enterobacteriaceae . This human pathogen is highly virulent, causing infections with a very low infectious dose transmitted by the faecal–oral route. The severity of the illnesses it causes ranges from mild, self-limited diarrhoea to severe dysentery, killing thousands of people around the world each year [1, 2]. Shigella is endemic in many low- and middle-income countries and mostly affects children under the age of 5 years. In high-income countries, Shigella is frequently associated with travellers returning from endemic areas, men who have sex with men (MSM) and Orthodox Jewish communities [2–5].

The current classification, based on biochemical tests and O-antigen typing, splits the genus Shigella into four serogroups (originally considered to be species) – Shigella dysenteriae (serogroup A), S. flexneri (serogroup B), S. boydii (serogroup C) and S. sonnei (serogroup D) – and 60 serotypes [6–8]. However, modern population genetics methods based on bacterial DNA sequences have shown that Shigella can be grouped into eight phylogenetically distinct clusters within the species Escherichia coli [7, 9–11]. Despite the high level of genetic relatedness between E. coli and Shigella , they are maintained as separate entities for epidemiological and clinical reasons. Within E. coli , Shigella and entero-invasive E. coli (EIEC) are considered to form a specific pathovar, as both can cause dysentery and they have several characteristics in common, including the presence of a virulence plasmid (pINV) conferring an ability to invade host epithelial cells. The surveillance of Shigella and differentiation between Shigella and E. coli are, therefore, crucial [12–14].

The methods for classifying Shigella developed to date have been laborious or lacking in discrimination. Conventional Shigella typing is based on both biochemical and serological assays. Biochemical tests are used to differentiate between Shigella and E. coli and to determine serogroup. The serotype is then identified by slide agglutination with a large panel of antisera raised against lipopolysaccharide O antigens in rabbits. This approach has been used for laboratory surveillance since 1949 [6, 15]. However, this conventional Shigella typing method is a laborious, time-consuming procedure requiring considerable resources and a high level of expertise. In addition, intra- and interspecies cross-reactivity, the existence of rough strains that do not produce O antigens, and a lack of some typing sera, particularly for newly identified Shigella serotypes, make interpretation difficult, potentially leading to unreliable and erroneous results [16, 17].

Various, predominantly PCR-based, molecular methods targeting the H-antigen fliC gene and the O-antigen rfb gene cluster have been proposed to overcome these difficulties [18–21]. Shigella has long been considered non-flagellated, as its flagellin (fliC) gene is cryptic. Using fliC-RFLP, Coimbra and co-workers identified 17 different fliC restriction patterns, each corresponding to one or several Shigella serotypes [18]. The rfb-RFLP method, which involves amplifying the rfb locus by long-range PCR [22], yields different band patterns for most of the O-antigens, but cannot differentiate between Shigella and E. coli with similar or identical O-antigens [19, 22 ]. This method is also unable to differentiate between the S. flexneri 1–5, X and Y subserotypes [19]. All S. flexneri serotypes other than S. flexneri serotype 6 have the same O-antigen backbone structure, but differ in terms of glucosylation, O-acetylation and/or phosphoethanolamine (PEtN) modifications conferred by bacteriophage genes integrated into the host chromosome (gtrI, gtrIc, gtrII, gtrIV, gtrV, gtrX, oac and oac1b genes) or performed by enzymes encoded by plasmid genes (opt genes) [2, 20, 21, 23–26]. These genes are collectively referred to here as phage- and plasmid-encoded O-antigen modification (POAC) genes. Several multiplex PCRs targeting these POAC genes have been developed to identify S. flexneri subserotypes, but no such PCRs to date has ever been able to identify all Shigella serotypes [2, 20, 21, 27, 28].

With reports of the emergence of multidrug-resistant or even extensively drug-resistant Shigella strains in many countries in recent years and the current dearth of effective treatments, improvements in the laboratory surveillance of Shigella are essential [3, 29, 30]. Whole-genome sequencing (WGS) is becoming increasingly affordable and is now widely used for routine surveillance. The molecular methods described above have, therefore, fallen into decline. We recently demonstrated that the EnteroBase E. coli/Shigella core-genome multilocus sequence typing (cgMLST) scheme using hierarchical cluster (HC) analysis (cgMLST V1+HierCC V1) at different levels of resolution (H2000 to HC400) is a robust and portable method for the laboratory surveillance of Shigella infections and for monitoring the trends in globally circulating Shigella types. However, we have recommended the use of cgMLST coupled with in silico serotyping to maintain backward compatibility with the current Shigella serotyping scheme, thereby avoiding problems of non-comparability with the serotyping data amassed worldwide over a period of more than 70 years [7]. Two in silico tools, ShigaTyper and ShigEiFinder, have been developed to date [11, 31]. They can predict Shigella serotypes based on the wzx/wzy genes of the rfb gene cluster and can differentiate between E. coli and Shigella based on certain genetic markers, but we have shown that optimization of these tools is required, particularly in terms of the choice of molecular targets [7].

We used several genetic structures – the rfb gene cluster, POAC genes, seven housekeeping (HK) genes (adk, fumC, gyrB, icd, mdh, recA and purA), the fliC gene and clustered regularly interspaced short palindromic repeats (CRISPR) spacers – to develop ShigaPass, a new in silico tool for predicting Shigella serotypes from WGS assemblies. We developed and validated this new tool based on more than 4700 Shigella genome sequences from reference strains and clinical isolates that had already been characterized with the conventional serotyping scheme (Supplementary Data 1). We show here that ShigaPass outperforms all existing in silico tools, with a correct assignment rate of 98.5 %.

Methods

Strain selection and typing

In total, we studied 4716 Shigella reference strains and clinical isolates distributed in two datasets: a development dataset and a validation dataset (Supplementary Data 1).

The development dataset was used to develop ShigaPass, a tool predicting Shigella serotypes from various genetic structures: the rfb gene cluster; POAC genes; seven HK genes (from the EnteroBase E. coli / Shigella MLST scheme; https://enterobase.warwick.ac.uk/species/index/ecoli [32, 33]); the fliC gene; and CRISPR spacers. This dataset contained data for 440 Shigella reference strains and clinical isolates selected from the French National Reference Centre for E. coli , Shigella and Salmonella (FNRC-ESS) collection, Institut Pasteur, Paris, and four S. flexneri 5b strains retrieved from EnteroBase [33], to ensure that the strains and isolates studied were representative of the Shigella population, covering all four serogroups and 59 of the 60 serotypes (the serotype not covered being S. flexneri 1d) (Table S1). The strains from the development dataset were sequenced on various Illumina sequencing platforms (HiSeq, NovaSeq 6000, NextSeq 500) (Supplementary Data 1). In addition, data for a collection of 45 EIEC strains representative of the various previously described EIEC clusters [10, 34–36], 70 E. coli strains including 68 E. coli strains from the E. coli reference (ECOR) collection representative of the diversity of E. coli [7, 37], and 20 non-Shigella/E. coli strains were included; these sequences were selected from the FNRC-ESS or downloaded from NCBI and used as a negative control for the development dataset (Table S1).

The validation dataset consisted of 4276 clinical isolates of Shigella used to validate ShigaPass. Between January 2017 and September 2021 (when phenotypic typing was stopped), 4489 clinical Shigella isolates were received, serotyped and sequenced (NextSeq 500) at the FNRC-ESS, which is responsible for the microbiological surveillance of Shigella in France through a voluntary laboratory-based network of ~1000 clinical laboratories located in mainland France and its overseas territories. We included 4276/4489 (95.2 %) of these isolates in the validation dataset. These 4276 isolates were those for which the genomic sequences obtained had passed the quality control criteria of EnteroBase (sequence length: 3.7–6.4 Mb, number of contigs: ≤800, N50: >20 kb, proportion of N’s: <3 %, species assignment according to Kraken: >70 % contigs assigned) [33], and which had not already been used in the development dataset. A collection of non- Shigella isolates (n=28) was also included in the development dataset, as a negative control (Table S1). All the reference strains and clinical isolates originating from the FNRC-ESS were characterized by sero-agglutination and a panel of biochemical tests, according to the standard protocols previously described [38]. These sero-agglutination serotyping results are considered the reference results to which in silico serotyping results are compared. However, when a discordance between the phenotypic and in silico methods was detected, conventional serotyping was repeated at least once.

DNA extraction and sequencing

Genomic DNA was extracted with the Wizard Genomic DNA Kit (Promega), the Maxwell 16 cell DNA purification kit (Promega) or the MagNA Pure DNA isolation kit (Roche Molecular Systems), in accordance with the manufacturers’ instructions. WGS was performed with various Illumina platforms for the strains and isolates of the development dataset (HiSeq, NovaSeq 6000, NextSeq 500). For the validation dataset, WGS was performed as part of routine procedures at the FNRC-ESS, and at the Mutualized Platform for Microbiology (P2M) at Institut Pasteur, Paris. The libraries were prepared with the Nextera XT kit (Illumina) and sequencing was performed with the NextSeq 500 system (Supplementary Data 1). All reads were filtered with FqCleanER version 3.0 (https://gitlab.pasteur.fr/GIPhy/fqCleanER) with options -q 15 -l 50 to eliminate adaptor sequences and low-quality reads with Phred scores below 15 and a length of less than 50 bp [39]. Assemblies were generated with SPAdes version 3.15, with the following options: -k 21,33,55,77 --only-assembler --careful --cov-cutoff auto [40]. Short-read sequence data were submitted to the European Nucleotide Archive (ENA, https://www.ebi.ac.uk/ena/) (Supplementary Data 1) and were also uploaded onto EnteroBase. EnteroBase SPAdes assemblies were downloaded to evaluate the effect of assembly methods on the performance of ShigaPass. The genomic sequences from 4160 Shigella strains and isolates had already been published (Supplementary Data 1), and 541 new genomic sequences are presented here (Supplementary Data 1) [2, 7, 41, 42].

ShigaPass databases

ShigaPass predicts Shigella serotypes from WGS assemblies based on the variability of rfb gene cluster DNA sequences, POAC genes, MLST sequence type (ST), fliC alleles and CRISPR spacers. The complete and draft assemblies of the genomes studied are queried against these different databases with blast+ (blastn) version 2.12 [43]. The resulting hits are filtered to match the thresholds selected for sequence identity and coverage (Table S2), and the best-match hits are identified for each marker.

The rfb database contained the k-mer sequences of 42 of the 44 previously described Shigella O-antigen (rfb) gene clusters (Table S3) [7]. These rfb sequences were trimmed into 150 bp long k-mers with a sliding window of 50 bp, to facilitate detection with an automated blast pipeline. The mean number of k-mers per rfb was 170, with the number of k-mers ranging from 83 to 294. The rfb k-mers used in ShigaPass are publicly available from GitHub (https://github.com/imanyass/ShigaPass/tree/main/SCRIPT/ShigaPass_DataBases/RFB).

The same k-mer strategy was used to construct the POAC database containing the sequences of the previously published POAC genes (gtrI, gtrIc, gtrII, gtrIV, gtrV, gtrX, oac, oac1b and optII) found in S. flexneri serotypes 1–5, X and Y (Table S4) [2, 20, 21, 23–26].

The POAC k-mers used in ShigaPass are publicly available from: https://github.com/imanyass/ShigaPass/blob/main/SCRIPT/ShigaPass_DataBases/RFB/POAC-genes_150-mers.fasta.

The MLST database for seven E. coli / Shigella HK genes (adk, fumC, gyrB, icd, mdh, recA and purA) was downloaded from EnteroBase (June 2020) (https://enterobase.warwick.ac.uk/species/index/ecoli) [33] and MLST profiles were determined by ShigaPass according to Achtman’s E. coli / Shigella MLST typing scheme [32].

The fliC database consisted of 59 fliC sequences. In silico and conventional PCR were performed to extract the fliC gene from the genomes of the development dataset (Supplementary Material section ‘Characterization of fliC alleles and fliC phylogeny’) [44]. The fliC sequences used in ShigaPass are publicly available from: https://github.com/imanyass/ShigaPass/blob/main/SCRIPT/ShigaPass_DataBases/FLIC/.

The CRISPR database consisted of 22 spacer sequences obtained by in silico PCR targeting the CRISPR locus located downstream from the iap gene (iap CRISPR) in the genomes of the development dataset (Supplementary Material section ‘Determination of CRISPR profiles’) [45, 46]. The sequences of the spacers used in ShigaPass are publicly available from: https://github.com/imanyass/ShigaPass/tree/main/SCRIPT/ShigaPass_DataBases/CRISPR.

Detection of the ipaH gene

The invasion plasmid antigen H (ipaH) gene is a multicopy gene found in both the chromosome and plasmid of all Shigella and EIEC strains. The presence of sequences encoding the highly conserved C-terminal domain of IpaH proteins was used as an indicator of Shigella /EIEC (Table S4) [11, 31, 47–49]. Strains with no ipaH hits are considered ‘Not Shigella /EIEC’ (Fig. 1). As this gene was frequently split between multiple contigs, we applied a k-merization approach with the same options as for rfb and POAC genes. The k-mers used are publicly available from: https://github.com/imanyass/ShigaPass/tree/main/SCRIPT/ShigaPass_DataBases/IPAH.

Fig. 1.

Fig. 1.

Summary of the ShigaPass pipeline. (a) ShigaPass accepts draft or complete genome sequences. ShigaPass excludes non- Shigella /EIEC genomes on the basis of detection of the ipaH gene. blast-hit selection and sorting cut-offs are described in Table S2. The results are then compared to ShigaPass meta-profiles, a list of meta-profiles combining the four targets defined for each serotype. ShigaPass predicts the serotype whenever it finds a match between rfb cluster profile and at least two of the three remaining markers within the defined ShigaPass meta-profiles. A result of ‘Shigella spp.’ can indicate genome contamination, poor quality or new emerging serotypes. ShigaPass is available as a standalone version in GitHub (https://github.com/imanyass/ShigaPass/) and as an online version via Galaxy (https://galaxy.pasteur.fr/). rfb, fliC, CRISPR spacers and MLST ST databases can be accessed via the RFB, FLIC, CRISPR and MLST repositories, respectively, in GitHub. (b) Summary of the rfb decision tree and actions taken to differentiate between serotypes possessing identical or highly similar rfb sequences. SF1–5, X and Y subserotypes are identified on the basis of POAC k-mer detection. SB1/SB20 and SB6/SB10 are distinguished on the basis of detection of the galF and tauA genes, respectively. SD3/SD16 and SD2/SDprov.BEDP02-5104 are distinguished on the basis of detection of the additional serotype-specific rfb. SB, S. boydii ; SD, S. dysenteriae ; SF, S. flexneri; SS, S. sonnei; SDprov., S. dysenteriae provisional BEDP 02–5104; FlexSerotype, S. flexneri 1–5, X and Y subserotypes.

All the genes used in this study are described in Table S4.

Evaluation of ShigaPass

We compared the performance of ShigaPass version 1.5.0 (https://github.com/imanyass/ShigaPass) with that of existing in silico serotyping tools: ShigaTyper version 1.0.6 (using short reads) (https://github.com/CFSAN-Biostatistics/shigatyper) [31], ShigEiFinder version 1.2.0 and its latest version 1.3.2 uploaded onto GitHub on 21 April 2022 (using both short reads and assemblies) (https://github.com/LanLab/ShigEiFinder) [11]. The following categories were used in our comparison of the different tools: correct, uncertain, incorrect and none. A correct assignment means that the predicted serotype was the same as the phenotypically determined serotype. An uncertain assignment means that more than one serotype was predicted and that the correct serotype was listed among them or, for ShigEiFinder, that only the cluster level ( Shigella clusters Cs1–Cs3, and CSS as defined by Zhang and co-workers [11]) was obtained. An incorrect assignment means that the predicted serotype was different from the phenotypically determined serotype. ‘None’ means that no serotype was predicted. The correct assignment rate was determined as the number of strains correctly assigned by the in silico tool/total number of strains. The capacity of an in silico tool to report any result (i.e. Shigella serotype, Shigella cluster, Shigella spp. or even not Shigella ) regardless of the accuracy or outcome of the prediction is described here as sensitivity and define as (total number of strains – strains with no prediction)/total number of strains.

Results

Design of ShigaPass: the databases

We made use of the variability of different genetic structures to develop ShigaPass, an in silico tool for predicting Shigella serotypes from assembled genomes.

Our first target was the rfb gene cluster, which encodes the Shigella serotype-specific lipopolysaccharide O antigen. Forty-four different O-antigen rfb gene clusters, ranging in size from 9 to 17 kb, have been described in Shigella (Table S3) [7]. The E. coli and Shigella rfb sequences have a low GC content and are not well covered in strains sequenced with the Illumina Nextera XT kit for library preparation, as the transposase-based library is biased against AT-rich sequences [7, 50]. Genome assemblies may, therefore, generate incomplete rfb data. Consequently, we trimmed the rfb into k-mers of 150 bp in length with a 50 bp sliding window, to facilitate detection with a blast pipeline. The k-mers including IS elements or displaying >90 % coverage and identity between different serotypes were removed. Each k-mer in the database was generally specific for one serotype, as non-specific k-mers were discarded, with some exceptions (see below).

Several serotypes have identical or highly similar rfb sequences. These serotypes include S. boydii 1 and S. boydii 20, S. boydii 6 and S. boydii 10, S. dysenteriae 2 and S. dysenteriae prov. BEDP02-5104, and S. dysenteriae 3 and S. dysenteriae 16 [7]. S. boydii 1 and S. boydii 20 have the same rfb. However, that of S. boydii 20 is unusual in that it begins with an insertion sequence (IS), the galF gene preceding the rfb cluster having been deleted [7]. We therefore used the galF gene as a marker to differentiate between S. boydii 1 and S. boydii 20 (https://github.com/imanyass/ShigaPass/blob/main/SCRIPT/ShigaPass_DataBases/RFB/galF_SB1.fasta). A comparison between the S. boydii 6 and S. boydii 10 genomes showed that the tauA gene, which encodes a taurine-binding periplasmic protein, was present only in S. boydii 6 genomes. This gene was, therefore, used as a marker of S. boydii 6 (https://github.com/imanyass/ShigaPass/blob/main/SCRIPT/ShigaPass_DataBases/RFB/taurine_SB6.fasta). S. dysenteriae prov. BEDP 02–5104 carries two different rfb clusters, one of which is chromosomal and identical to that of S. dysenteriae 2, whereas the other is plasmid-borne [7]. S. dysenteriae 16 also has two rfb regions: a variable 6 kb remnant rfb (identical to that of S. dysenteriae 3) in the normal location of rfb (i.e. located next to the colanic acid biosynthesis gene cluster [7]), and a second complete rfb located elsewhere on a genomic island [7]. These additional rfb regions were used as indicators for S. dysenteriae prov. BEDP 02–5104 and S. dysenteriae 16, respectively. Some S. dysenteriae 16 strains may not retain the genomic island containing the complete and specific rfb. We therefore decided to not exclude the k-mers targeting the partial rfb (which is common to S. dysenteriae 3) [7], as the detection of these common k-mers could be used as a marker for S. dysenteriae 16 or for S. dysenteriae 3 (in cases in which the specific S. dysenteriae 3 k-mers are not detected due to poor coverage of this region). However, in such cases, the genome is typed as ‘ S. dysenteriae 3 / S . dysenteriae 16’.

Another exception is the presence in S. sonnei of a plasmid-borne rfb, on the virulence plasmid, pINV, which seems to be frequently lost during subculture, and a chromosomal remnant rfb (~2 kb). This remnant was not sufficiently discriminatory if not covered in its entirety, because, even though the entire remnant sequence is found only in S. sonnei strains, parts of this sequence are present in non- S. sonnei strains. There is, therefore, ultimately no portion unique to S. sonnei . The k-mers common to other serotypes were not, therefore, removed. We overcame this problem by maintaining the S. sonnei rfb k-mers in a separate database (https://github.com/imanyass/ShigaPass/blob/main/SCRIPT/ShigaPass_DataBases/RFB/RFB_serogroup_D_150-mers.fasta) and subjecting them to blast search only if no other rfb was detected.

The S. flexneri strains from the S3 cluster (serotypes 1–5, X, Y) have the same O-antigen backbone structure. Their serotypes result from modifications to the O-antigen tetrasaccharide repeat conferred by the gtr, oac and opt POAC genes [2]. Consequently, the detection of these POAC genes can predict S. flexneri 1–5, X and Y serotypes and subserotypes (Table S5) [20, 27, 28]. As these genes have a low GC content, we used the same k-mer strategy as for rfb to construct the database of POAC genes (Table S4). When ShigaPass detects a genome carrying the rfb of S. flexneri 1–5, X and Y, this genome is queried against the k-mer database containing the sequences of the POAC genes. The S. flexneri serotype is then determined on the basis of the correspondence between S. flexneri 1–5, X and Y serotypes and the POAC genes detected (Table S5).

Our second target was the seven HK genes of the Achtman MLST scheme. Using the Shigella strains in the development dataset (n=440/575), we identified 67 STs for 433 strains, as shown in Table 1. The ST could not be determined for seven strains due to missing data for one of the MLST alleles (fumC was not called in six strains and adk in one strain). Most of the STs had been reported before, but some were new. Given that STs are determined by seven genes, this variation is unsurprising, because mutations of any one of these genes can give rise to a new ST. We found that 44 of the 67 STs identified belonged to five ST complexes (the ST147, 148, 152, 243 and 245 complexes) found in the S2, S1, SON, S1 and S3 Shigella clusters (clusters previously defined by Yassine et al. [7]), respectively. The Shigella S2 cluster contained most of the STs that did not belong to ST complexes.

Table 1.

Multilocus sequence types (STs) identified for Shigella strains in the development dataset

Serotype

N

ST (n)

S. boydii

86

1

5

243 (4), 1746 (1)

2

6

145 (5), 7385 (1)

3

3

243 (2), 1743 (1)

4

7

145 (5), 1130 (2)

5

4

149 (4)

6

1

243 (1)

7

3

1749 (3)

8

3

243 (3)

9

4

257 (2), 1751 (1), 1752 (1)

10

6

243 (6)

11

7

1765 (3), 5475 (2), -1151 (1)

12

2

1767 (1), 1768 (1)

14

8

1273 (8)

15

2

250 (1), 1766 (1)

16

2

1750 (2)

17

1

5619 (1)

18

3

243 (3)

19

5

243 (5)

20

8

243 (7), 7375 (1)

21

3

243 (3)

22

3

1753 (2), 1769 (1)

S. dysenteriae

90

1

16

146 (12), 260 (3), 1014 (1)

2

9

147 (3), 273 (3), 288 (1), 1747 (1), 1759 (1)

3

6

148 (4), 145 (1), 7364 (1)

4

4

252 (3), 7371 (1)

5

3

243 (1), 1770 (1), -1148 (1)

6

2

252 (2)

7

3

243 (2), 1742 (1)

8

3

289 (2), 7363 (1)

9

3

252 (2), 1741 (1)

10

2

290 (2)

11

2

253 (1), 258 (1)

12

6

148 (5), 253 (1)

13

2

261 (1), –121 (1)

14

1

148 (1)

15

1

1739 (1)

16

8

148 (8)

17

6

252(6)

18

1

273(1)

prov. BEDP 02–5104

12

288 (7), 1761 (4), 1760 (1)

S. sonnei

103

152 (82), 1502 (16), 1504 (1), 5517 (1), 6035 (1), -11 (1), -1147 (1)

S. flexneri 1–5, X, Y

139

PG1

41

1a

7

245 (5), 1024 (1), 6443 (1)

1b

13

245 (11), 7384 (2)

1 c/7 a

6

245 (6)

7b

2

245 (2)

2b

3

245 (2), 5672 (1)

3b

2

7384 (2)

4a

1

8878 (1)

4av

2

245 (1), 1022 (1)

4b

3

245 (2), 1022 (1)

X

1

245 (1)

Y

1

245 (1)

PG2

29

3a

20

245 (16), 1025 (3), 5312 (1)

3b

9

245 (9)

PG3

46

1a

2

245 (2)

2a

27

245 (26), 10 662 (1)

2b

2

245 (1), 5240 (1)

5a

2

245 (2)

X

1

245 (1)

Xv

6

245 (6)

Y

3

245 (2), 6200 (1)

Yv

3

245 (3)

PG4

9

3a

5

628 (5)

3b

1

628 (1)

4bv

1

628 (1)

X

2

628 (2)

PG5

6

5a

2

631 (2)

5b

4

631 (4)

PG6

2

Y

2

628 (2)

PG7

6

4a

1

630 (1)

4av

3

630 (3)

Yv

2

630 (2)

S. flexneri 6

22

Boyd 88

15

145 (14), 1512 (1)

Hertfordshire

4

145 (4)

Manchester

2

145 (2)

Newcastle

1

145 (1)

TOTAL

440

STs present in several serotypes are highlighted in bold.

PG, phylogenetic group described previously [2]; prov., provisional.

The third target studied was the fliC gene, encoding the H-antigen. In the Shigella development dataset, we identified 59 fliC alleles in total, ranging in length from 1575 to 2013 bp, excluding IS elements. The fliC alleles contained premature stop codons, IS elements or both, due to their cryptic nature. The results for fliC clustering were consistent with the Shigella population structure determined by cgMLST (Fig. S1) [7]. A single unique fliC allele was found to correspond to each of the following serotypes: S. dysenteriae 4 and 7, and S. boydii 5, 7, 14, 15, 16 and 17 (Fig. S1). No fliC gene was found in S. dysenteriae 2, 18 or provisional (prov.) BEDP 02–5104 strains, or in S. boydii 12 strains, a few S. sonnei strains or S. flexneri 6 strains from the S1e cluster. Attempts to extract a new fliC from the region between fliA and fliD in these strains were unsuccessful due to IS-mediated deletions in the fliC gene region. Some serotypes possessed multiple fliC alleles differing by a few single nucleotide polymorphisms (SNPs). This was the case for the fliC genes from the ShH1 complex specific for S. sonnei , and the ShH2 complex specific for the S3 cluster. Conversely, similar fliC alleles belonging to the ShH3 or ShH4 complexes were also found in Shigella strains of different serotypes. However, these fliC sequence complexes included some fliC variants specific to particular serotypes, such as ShH3 in S. boydii 11, ShH4 in S. boydii 5 and ShH5 in S. boydii 9 (Fig. S1). In some strains, the presence of certain IS elements within fliC may also be specific to certain serotypes, as for IS2 within ShH21 ( S. dysenteriae 11), IS1 within ShH32 ( S. dysenteriae 9) and IS2 within ShH33 ( S. dysenteriae 3).

The fourth target was the CRISPR spacers, which are short sequences (22–38 bp) interspersed with short palindromic repeats (direct repeats or DRs) in CRISPR arrays. We identified 22 different spacers arranged in 27 CRISPR profiles in the Shigella development dataset. The number of spacers per profile ranged from one to seven. These CRISPR profiles were highly consistent with the Shigella population structure determined by cgMLST (Table 2) [7]. For example, the A-var2 and A-var3 spacers were found only in strains from the Shigella S1 and S3 clusters, respectively. No spacers were detected in the S. dysenteriae 14 (n=1) and S. dysenteriae 17 (n=16) strains, which lacked the entire CRISPR array and had only a remnant iap gene (ending after the first 240 nt for S. dysenteriae 17 and after the first 650 nt for S. dysenteriae 14), followed by IS elements. However, the analysis of a PacBio sequence from S. dysenteriae 14 ATCC 49346 (GenBank accession number CP026832) [51], which was not included in this study, revealed the presence of the A-var2 spacer. This finding highlights the possibility of IS-driven alterations at the iap CRISPR locus. These IS-modified CRISPR structures were encountered only in S. dysenteriae 14 and 17. We therefore designed specific k-mers recognizing DNA sequences encompassing the end of the iap gene and the first few nucleotides of the IS. One particular feature concerned spacer 16, which is normally located within the iap CRISPR locus of S. boydii 12. This spacer was also identified on a separate contig outside of the iap locus in various serotypes (Table 2). Examination of the region surrounding this spacer identified no DR, but we found that this sequence, 100 % identical to spacer 16, was actually located on the pINV plasmid.

Table 2.

CRISPR profiles in the Shigella strains of the development dataset

Shigella clusters

CRISPR spacers

Serotypes

S1a

A-var2

SD3, SD4, SD6, SD9, SD11, SD12, SD13, SD15, SD16

None (iapIS1-A14)

SD14

None (iapIS1-93-119)

SD17

S1b

A-var2

SD3, SB2, SB4, SB11, SB14, SF6

S1c

A-var2

SB1, SB3, SB6, SB8, SB10, SB18, SB19, SB20, SB21, SD5

S1d

A-var2, 9

SD7

S1e

A-var2, 9

SF6

S2a

A-var1, 12, (16)

SB17

S2b

A-var1, 12, 3, 5, 8, (16)

SB5

A-var1, 12, 3, 5, (16)

SB16

S2c

A-var1, (16)

SB11

S2d

A-var1, 5, 8, 11-var1

A-var1, 12, 3, 5, 11-var1

A-var1, 12, 3, 5, 8, 11-var1

A-var1, 12, 3, 3, 5, 8, 11-var1

A-var1, 5, 8, 11-var1, 11-var1

SD2

A-var1, 12, 3, 5, 8, 11-var1

SD15

A-var1, 12, 3, 5, 8

SD18

A-var1, 12, 3, 5, 8, 8, 11-var1

SD prov. BEDP 02–5104

S2e

A-var1, 12, 3, 5, 8, 11-var1, (16)

SB9

S2f

A-var1, 12, (16)

SB7

S3

A-var3, x, (16)

SB22, SF1-5, X, Y

SON

A-var0

A-var0, 7

A-var0, 10

A-var0, 27

A-var0, 7, 10

A-var0, 27, 7

A-var0, 27, 10

A-var0, 27, 7, 10

SS

SD1

A-var1, 6, 24, 21, (16)

SD1

SD10

A-var4, 18, 4, (16)

SD10

SD8

A-var0, 1

SD8

SB12

A-var0, 6, 24, 21, 16, (16)

SB12

(16), spacer 16 is not located in the iap CRISPR locus; prov., provisional; SB, S. boydii ; SD, S. dysenteriae ; SF, S. flexneri ; SS, S. sonnei .

We found that MLST, CRISPR and fliC alone were not sufficiently discriminatory to replace traditional serotyping, as many serotypes share the same allele. The O-antigen rfb cluster is the only marker known to be highly specific for serotypes, but it cannot be used alone either, because many E. coli strains carry rfb clusters that are identical, or highly similar, to those of Shigella [7]. Thus, for serotype inference, rfb analysis should be coupled with analyses of additional genetic structures. At least two additional genetic structures are required because the MLST ST may be modified by a single mutation in one of the seven HK genes, and the CRISPR or fliC allele may be affected by IS mobilization. We therefore tested several associations of techniques (Table S6). We found that the identification of the different Shigella serotypes and differentiation between Shigella and non- Shigella were optimal for a combination of MLST, CRISPR, fliC and rfb analyses. The extraction of these four genetic structures from 440 Shigella strains of the development dataset revealed 137 distinct meta-profiles, which we defined as ShigaPass meta-profiles (Table S7) (https://github.com/imanyass/ShigaPass/blob/main/SCRIPT/ShigaPass_DataBases/ShigaPass_meta_profiles_v5.csv).

Design of ShigaPass: the pipeline

ShigaPass is a shell script designed to predict Shigella serotypes from assembled genomes. The workflows adopted by ShigaPass are outlined in Fig. 1. Assemblies are queried against the different databases with BLASTn. Detection of the ipaH gene – present as multiple copies on both the chromosome and pINV in Shigella and EIEC – is used as an exclusion checkpoint. Strains with no ipaH hits are considered ‘Not Shigella /EIEC’ and only ipaH-positive strains are subjected to ShigaPass prediction. ShigaPass then uses the reference databases to identify the rfb gene clusters, MLST STs, fliC alleles and CRISPR spacer sequences (Fig. 1). Shigella serotype is determined by a match between the results obtained for the four genetic structures and one of the ShigaPass meta-profiles. (https://github.com/imanyass/ShigaPass/blob/main/SCRIPT/ShigaPass_DataBases/ShigaPass_meta_profiles_v5.csv). ShigaPass can deal with new profile combinations (due to variations of fliC sequences, CRISPR profiles and ST alleles) or cases of missing data, by predicting the serotype whenever it finds a match between the rfb cluster profile and two of the other three genetic structures and a defined meta-profile (in such cases ShigaPass indicates a ‘75 % match’). When ShigaPass detects the specific rfb of S. flexneri of cluster 3, an analysis of POAC genes is performed to identify S. flexneri 1–5, X and Y and their subserotypes. If multiple rfb hits are detected, the predominant rfb, with the largest number of hits, is identified and a comment mentioning the number of rfb hits is displayed to alert the user. In cases of more than one leading rfb with equal numbers of hits (a situation never observed with either our development or validation set), the rfb is selected randomly. When ShigaPass cannot assign a Shigella serotype, it returns a ‘ Shigella spp.’ or ‘EIEC’ result. ‘Shigella spp.’ is assigned if the script fails to recognize a matching profile, but does recognize a specific Shigella MLST. This assignment may be encountered if the genome is contaminated, of poor quality, or in cases of new emerging serotypes. ‘EIEC’ is identified if ipaH is detected but no matching profile is identified. ShigaPass is freely available as a standalone version from GitHub (https://github.com/imanyass/ShigaPass) under GNU General Public License v3 (GPL3) and as an online version via Galaxy (https://galaxy.pasteur.fr/).

Evaluation of ShigaPass and other in silico tools with the development dataset

The development dataset consisted of 575 bacterial strains, including 440 Shigella strains (86 S. boydii ; 90 S. dysenteriae ; 139 S. flexneri 1–5, X, and Y; 22 S. flexneri 6; and 103 S. sonnei strains) (Table S1). All Shigella serotypes are represented within this dataset. In addition, 45 EIEC and 90 non- Shigella /EIEC genomes – 70 E. coli , six Escherichia albertii , two Escherichia marmotae , two Campylobacter jejuni , one Enterobacter hormaechei , one Enterococcus faecium , two Klebsiella pneumoniae , three Listeria monocytogenes , two Salmonella enterica and one Staphylococcus aureus – were used as a control group in the development dataset (Table S1).

All non- Shigella isolates were correctly assigned to either EIEC or ‘not Shigella /EIEC’. ShigaPass incorrectly assigned 13 of the 440 Shigella genome sequences (2.9%) (Tables 3 and 4). One S. dysenteriae 16 genome (strain 08–2736) was identified as ‘ S. dysenteriae 3 /S. dysenteriae 16’ because only the rfb region common to both S. dysenteriae 3 and S. dysenteriae 16 was detected. Twelve S. flexneri 1–5, X and Y strains were incorrectly identified. Examination of these genomes revealed the presence of non-functional genes due to IS elements, premature STOP codons or SNPs in POAC genes, potentially accounting for the discrepancies between phenotypic and genomic results. For example, strain NCDC 5797–70 (ERR832473) was phenotypically identified as S. flexneri X, but was predicted to be a 3a strain by genomic analyses, because it carried the gtrX and oac genes. However, the oac gene was found to be disrupted by an IS from the IS3 family. All discrepant results for the two datasets are presented in Table 3. Interestingly, we found that some S. flexneri 3b isolates harboured the oac1b variant rather than oac. As both these variants give positive reactions with S. flexneri group 6 antisera, this phenotype was missed with sero-agglutination methods.

Table 3.

Discrepant results between serotyping and genoserotyping for all Shigella strains

Shigella cluster

Accession no.

Serotype

Genotype

Explanation

S1

ERR9828686

ERR5984600*

SD16

SD3/SD16

Specific rfb k-mers for SD16 not detected

ERR5917462

SF6

Shigella spp.

Mixture of SF and SD16

SON

ERR5976096

ERR5970000

ERR5976076

SS

Shigella spp.

Mixture of SF and SS

S3-PG1

ERR5917460

ERR5917552

ERR5919289

SF1a

SF1b

Premature stop codon in oac1b

ERR5908871

SF1a

SF1b

Premature stop codon in oac1b

ERR9828906

ERR9828926

SF1a

SF1b

Premature stop codon in oac1b

ERR5956238

SF1a

SF1b

oac1b disrupted by IS1

ERR5951466

SF1a

SF1c

gtrIC disrupted by IS1

ERR5909154

ERR042807*

SF3b

SF1b

gtrB-I disrupted by IS3

ERR9828728

SF3b

SF1b

gtrA-I disrupted by IS3

ERR9828784

SF3b

SF1b

Premature stop codon in gtrI

ERR5908533*

ERR5908899

ERR5951503

ERR5951556*

ERR5951560

ERR5952709

ERR5952711

ERR9828751

SF3b

Unknown

New combination: wzx1-5, oac1b

ERR5954494*

SF4a

SF4av

opt-plasmid instability

ERR832454*

SF4av

SF4bv

Mutations of oac

ERR5908858

ERR9828867

SFX

Unknown

New combination: wzx1-5, gtrIC

ERR5951465

ERR5896801

ERR5954596

ERR9828917

SFYv

SF4av

Mutations of gtrIV

S3-PG2

ERR5917480

ERR5961583

SF3a

SF3b

gtrX not well covered

ERR5961537

SF3a

SF3b

Loss of gtrX

ERR5956351

SF3a

Unknown

New combination: gtrI, gtrX

(mixture of two SF isolates)

ERR5908957

SF3b

SF3a

gtrB disrupted by IS3

ERR5956260

SF3b

SF3a

gtrB disrupted by IS3

ERR5919332

SFY

SF3b

oac disrupted by IS3

S3-PG3

ERR5954491

SF2a

Shigella spp.

Mixture of SF and SS

ERR5896714

SF2a

Unknown

New combination: wzx1-5, gtrII, optII

ERR5908853

ERR5908873

ERR5919308

ERR9828720

SF2a

SFY

gtrII not well covered

ERR5917554

SF2b

SF2a

gtrX not well covered

ERR5954565

SF2b

SFX

gtrII not well covered

ERR5908546

SFX

SFXv

Mutations of optII

ERR5896795

SFXv

SFYv

gtrX not well covered

ERR5896710

SFXv

Unknown

New combination: wzx1-5, gtrII, gtrX, optII

Mutations of gtrII

ERR5908557

ERR5896688*

ERR5956227

SFY

SF2a

Mutations of gtrII

ERR5896723*

SFYv

SFXv

Premature stop codon in gtrX

S3-PG4

ERR832473*

SFX

SF3a

oac disrupted by IS3

S3-PG6

ERR042832*

ERR042833*

SFY

SFX

Mutations of gtrX

ERR5896775

ERR5896763

ERR5896779

SFYv

SFXv

Mutations of gtrX

S3-PG7

ERR5896701*

ERR5896746

ERR5896747

ERR5896749

ERR5896757

ERR5896761

ERR5896769

ERR5896770

ERR5896771

ERR5896735

ERR5896783

SF4a

SF4av

opt-plasmid instability

ERR9828705

SFY

SF4a

gtrB-IV disrupted by IS1

ERR9828908

ERR5896728*

SFYv

SF4av

No obvious cause for discrepancy detected

*Strains from the development dataset.

SF, S. flexneri; SS, S. sonnei; SD, S. dysenteriae.

Table 4.

Summary of correct assignment rate for the various in silico Shigella serotype prediction tools, determined with the development dataset strains

Bacteria

N

ShigaPass v1.5.0

ShigaTyper v1.0.6

ShigEiFinder Fasta v1.2.0

ShigEiFinder Reads v1.2.0

ShigEiFinder Fasta v1.3.2

ShigEiFinder Reads v1.3.2

S. boydii

86

86 (100 %)

60 (69.8 %)

72 (83.7 %)

60 (69.8 %)

76 (88.4 %)

76 (88.4 %)

S. dysenteriae

90

89 (98.9 %)

64 (71.1 %)

63 (70 %)

55 (61.1 %)

70 (77.8 %)

71 (78.9 %)

S. flexneri 1–5, X, Y

139

127 (91.4 %)

122 (87.8 %)

112 (80.6 %)

94 (67.6 %)

125 (89.9 %)

123 (88.5 %)

S. flexneri 6

22

22 (100 %)

21 (95.5 %)

21 (95.5 %)

19 (86.4 %)

22 (100 %)

22 (100 %)

S. sonnei

103

103 (100 %)

87 (84.5 %)

96 (93.2 %)

97 (94.2 %)

96 (93.2 %)

97 (94.2 %)

EIEC

45

45 (100 %)

44 (97.8 %)

42 (93.3 %)

37 (82.2 %)

42 (93.3 %)

37 (82.2 %)

Not Shigella /EIEC

90

90 (100 %)

88 (97.8 %)

88 (97.8 %)

88 (97.8 %)

88 (97.8 %)

88 (97.8 %)

Total (only Shigella )

440

427 (97 %)

354 (80.5 %)

364 (82.7 %)

325 (73.9 %)

389 (88.4 %)

389 (88.4 %)

Total (All)

575

562 (97.7 %)

486 (84.5 %)

494 (85.9 %)

450 (78.3 %)

519 (90.3 %)

514 (89.4 %)

ShigaPass results were compared to those of ShigaTyper and ShigEiFinder for all the Shigella and non- Shigella isolates in the development dataset (Table 4, Fig. 2, Tables S8 and S9). ShigaPass gave the best serotype prediction results. The correct assignment rates obtained were 97.7 % (562/575) for ShigaPass, 90.3 % (519/575) for ShigEiFinder version 1.3.2 (assemblies), 89.4 % (514/575) for ShigEiFinder version 1.3.2 (short reads) and 84.5 % (486/575) for ShigaTyper (Table 4). The poorer performances of ShigaTyper and ShigEiFinder relative to ShigaPass were mostly due to the inaccurate typing of several S. boydii serotypes and the new and provisional serotypes of S. dysenteriae (16, 17 and prov. BEDP 02–5014) (Fig. 2). In particular, 14–20 % of the strains from S. boydii 1, 2, 11 and 19 and 100 % of the S. boydii 10 strains were incorrectly assigned, or not assigned at all, by both ShigaTyper and ShigEiFinder. There were also problems with the assignment of some S. boydii 4, 5, 6, 20 and 21 strains, specifically with ShigaTyper (no assignment or incorrect assignment) (Fig. 2, Tables S8 and S9). In addition, ShigaTyper and ShigEiFinder still considered ‘ S. boydii 13’ and ‘atypical S. boydii 13’ to be part of the total Shigella population (Tables S8 and S9). However, they are now considered to correspond to E. albertii and attaching and effacing (A/E) E. coli , respectively. As a result, ShigaPass correctly identifies them as ‘not Shigella /EIEC’.

Fig. 2.

Fig. 2.

Stacked-bar chart displaying the results (percentage assignment) for each tool for each Shigella serotype in the development dataset. S, ShigaPass; T, ShigaTyper [31]; F, ShigEiFinder using SPAdes assemblies [11]; R, ShigEiFinder using short reads [11]. SB, S. boydii ; SD, S. dysenteriae ; SON, S. sonnei ; SF, S. flexneri; EIEC, entero-invasive E. coli; Not, Not Shigella /EIEC. A correct assignment means that the predicted serotype was the same as the phenotypically determined serotype. An uncertain assignment means that the correct serotype was listed among others or, for ShigEiFinder, that only the cluster level (Cs1 to Cs3) was obtained. An incorrect assignment means that the predicted serotype was different from the phenotypically determined serotype. ‘None’ means that no serotype was predicted. More detailed results can be found in Tables S8 and S9. For ShigEiFinder, only results for version 1.3.2 are displayed.

Evaluation of ShigaPass and other in silico tools with the validation dataset

We evaluated the performance of ShigaPass with an independent dataset consisting of 4304 isolates, including 4276 Shigella isolates (85 S. boydii ; 49 S. dysenteriae ; 2393 S. sonnei ; 1616 S. flexneri 1–5, X and Y; 133 S. flexneri 6); 22 EIEC; three E. coli ; and three E. albertii strains (Table S1). All these strains were received and sequenced by the FNRC-ESS between 2017 and 2021, within the framework of the French national surveillance programme for Shigella infections and were characterized with biochemical tests and sero-agglutination assays. Eight Shigella strains could not be identified to serotype level in biochemical tests or seroagglutination assays because they were rough or non-serotypable. Two strains were classified as non-serotypable via seroagglutination due to a lack of sera against S. dysenteriae 14 and S. dysenteriae 17 in our laboratory since 2021. ShigaPass identified these two strains as S. dysenteriae 14 and S. dysenteriae 17, with a 100 % match. ShigaPass also efficiently identified all rough strains (Supplementary Data 1). However, these strains were not included in the analysis of correct assignment rates for in silico tools because it was not possible to compare their assignment with that achieved with laboratory methods. Following analysis of the validation dataset, we updated the meta-profile list with 65 new profiles, mostly reflecting ST variability (37/65, 56.9%) or new combinations comprising new meta-profiles (https://github.com/imanyass/ShigaPass/blob/main/SCRIPT/ShigaPass_DataBases/ShigaPass_meta_profiles_v5.csv). There was a 75 % match between these new profiles and the initial meta-profiles (matches for rfb and two of the other three genetic features) so even before the addition of these new profiles, it was still possible to identify genomes containing these new profiles correctly with ShigaPass.

The sensitivity rates were 100 % (4296/4296) for ShigaPass and ShigEiFinder (reads and assemblies) and 99.1 % (4258/4296) for ShigaTyper, which gave no prediction for 38 Shigella strains (Table 5, Tables S10 and S11). The rates of correct assignment were 98.5 % (4233/4296) for ShigaPass, 96.5 % (4147/4296) for ShigaTyper, 94 % (4039/4296) for ShigEiFinder version 1.3.2 (assemblies), and 92.9 % (3990/4296) for ShigEiFinder version 1.3.2 (short reads) (Table 5, Fig. 3, Tables S10 and S11). The S. sonnei genomes were correctly predicted with a rate exceeding 99 % (>2374/2393) by all the tested tools (Table 5). ShigaPass gave the best serotype prediction for S. boydii (100 %, 85/85), S. dysenteriae (97.7 %, 42/43) and S. flexneri serotype 6 (99.2 %, 132/133), with lower values obtained for ShigaTyper [ S. boydii , 81.2 % (69/85); S. dysenteriae 53.5 % (23/43); S. flexneri serotype 6 96.2 % (128/133)], ShigEiFinder version 1.3.2 (assemblies) [ S. boydii , 71.8 % (61/85); S. dysenteriae , 60.5 % (26/43); S. flexneri serotype 6, 85.7 % (114/133)] and ShigEiFinder version 1.3.2 (short reads) [ S. boydii , 43.5 % (37/85); S. dysenteriae , 34.9 % (15/43); S. flexneri serotype 6, 50.4 % (67/133)] (Table 5). Indeed, 8–12 % of the strains from S. boydii 2, 4 and 11, or S. dysenteriae 2; 20–100 % of S. boydii 10, and S. dysenteriae 3 or 16; and 100 % of S. dysenteriae 17 and prov. BEDP 02–5104 strains were incorrectly assigned by both ShigaTyper and ShigEiFinder. With this larger dataset, ShigEiFinder performed less well than ShigaTyper due to problems in the identification of S. boydii 1, 8, 14, 18, 19, 20 and 22, and S. dysenteriae 4, 6 and 12 (Fig. 3, Tables S10 and S11). The rate of correct identification for S. flexneri subserotypes was almost identical between ShigaPass (96.5 %, 1557/1614) and ShigaTyper (94.8 %, 1530/1 614), whereas ShigEiFinder version 1.3.2 performed less well (91.1 %, 1470/1614 with short reads and 88.7 %, 1431/1614 with assemblies) (Table 5).

Table 5.

Summary of correct assignment rate for the various in silico Shigella serotype prediction tools, determined with the validation dataset strains*

Bacteria

N

ShigaPass v1.5.0

ShigaPass (EnteroBase assemblies)

ShigaTyper v1.0.6

ShigEiFinder Fasta v1.2.0

ShigEiFinder Reads v1.2.0

ShigEiFinder Fasta v1.3.2

ShigEiFinder Reads v1.3.2

S. boydii

85

85 (100 %)

56 (65.9 %)

69 (81.2 %)

61 (71.8 %)

1 (1.2 %)

61 (71.8 %)

37 (43.5 %)

S. dysenteriae

43

42 (97.7 %)

32 (74.4 %)

23 (53.5 %)

20 (46.5 %)

2 (4.7 %)

26 (60.5 %)

15 (34.9 %)

S. flexneri 1–5, X, Y

1614

1556 (96.4 %)

1265 (78.4 %)

1530 (94.8 %)

881 (54.6 %)

2 (0.1 %)

1431 (88.7 %)

1470 (91.1 %)

S. flexneri 6

133

132 (99.2 %)

72 (54.1 %)

128 (96.2 %)

114 (85.7 %)

5 (3.8 %)

114 (85.7 %)

67 (50.4 %)

S. sonnei

2393

2390 (99.9 %)

2393 (100 %)

2374 (99.2 %)

2383 (99.6 %)

2385 (99.7 %)

2383 (99.6 %)

2384 (99.6 %)

EIEC

22

22 (100 %)

22 (100 %)

20 (90.9 %)

21(95.5 %)

9 (40.9 %)

21 (95.5 %)

13 (59.1 %)

Not Shigella /EIEC

6

6 (100 %)

6 (100 %)

3 (50 %)

3 (50 %)

4 (66.7 %)

3 (50 %)

4 (66.7 %)

Total (only Shigella )

4268

4205 (98.5 %)

3818 (89.5 %)

4124 (96.6 %)

3459 (81 %)

2395 (56.1 %)

4015 (94.1 %)

3973 (93.1 %)

Total (All)

4296

4233 (98.5 %)

3846 (89.5 %)

4147 (96.5 %)

3483 (81.1 %)

2408 (56.1 %)

4039 (94 %)

3990 (92.9 %)

*Except for eight Shigella strains (of 4276) with an undesignated serotype with the conventional serotyping methods (biochemical tests and slide agglutination assays); EIEC, entero-invasive E. coli.

Fig. 3.

Fig. 3.

Stacked-bar chart displaying the results (percentage assignment) for each tool for each Shigella serotype in the validation dataset. S, ShigaPass; T, ShigaTyper [31]; F, ShigEiFinder using SPAdes assemblies [11]; R, ShigEiFinder using short reads [11]. SB, S. boydii ; SD, S. dysenteriae ; SON, S. sonnei ; SF, S. flexneri . A correct assignment means that the predicted serotype was the same as the phenotypically determined serotype. An uncertain assignment means that the correct serotype was listed among others or, for ShigEiFinder, that only the cluster level (Cs1 to Cs3) was obtained. An incorrect assignment means that the predicted serotype was different from the phenotypically determined serotype. ‘None’ means that no serotype was predicted. More detailed results can be found in Tables S10 and S11. For ShigEiFinder, only results for version 1.3.2 are displayed.

All non- Shigella isolates were correctly identified by ShigaPass, and none of the Shigella isolates was misidentified as non- Shigella , highlighting the robustness of the genetic structures used as markers to differentiate between Shigella and non- Shigella isolates. By contrast, 18 Shigella isolates were misidentified as EIEC by ShigaTyper. Eight and 138 Shigella isolates could not be resolved (results output as Shigella /EIEC unclustered) by ShigEiFinder version 1.3.2, based on assemblies and short reads, respectively (Tables S10 and S11).

The validation dataset included 63 strains incorrectly assigned by ShigaPass (Table 3). One S. dysenteriae 16 strain was identified as ‘ S. dysenteriae 3/ S. dysenteriae 16’ due to loss of the genomic island containing the second, complete rfb. Three S. sonnei , one S. flexneri 6 and one S. flexneri 2a strains were identified as ‘ Shigella spp.’ due to the contamination of genomic sequences with at least two different serotypes. The 57 remaining discrepancies were due to differences between the phenotypic and genotypic signatures encountered in S. flexneri 1–5, X and Y (Table 3). In addition to some of the prophage genes being non-functional, some prophage genes seemed to have been lost by some strains during subculture. For example, isolate 201 904 879 (ERR5961537) was phenotypically identified as S. flexneri 3a, implying that it carried both the oac and gtrX genes, but only the oac gene was detected by ShigaPass, resulting in a genomic classification as 3b. Reagglutination was performed on several colonies from a stock culture. Some colonies were serotyped as S. flexneri 3a and others as S. flexneri 3b, suggesting an instability of the gtrX prophage or a mixture of these two populations. Another issue was frequently encountered with S. flexneri strains harbouring the plasmid-borne opt gene responsible for PEtN modification of the O-antigen. This PEtN modification confers positive agglutination with MASF IV-1 serum, which was not systematically used before 2018. Many strains were genomically identified as S. flexneri 4av (positive agglutination with MASF IV-1) but were phenotypically recognized as S. flexneri 4a (negative agglutination with MASF IV-1) when sero-agglutination was repeated, probably due to loss of the plasmid-borne opt gene.

Most of the false-negative and false-positive results obtained were due to poor target gene coverage, or the presence of a mixture of multiple Shigella isolates. The AT-rich rfb gene cluster and POAC genes are not well covered by transposase-based libraries (Nextera XT kit) such as those generated by our public health sequencing platform [7]. All validation strains were sequenced by this platform. Thus, when we used SPAdes assemblies produced by EnteroBase, which filters out low-coverage contigs, the rate of correct assignment dropped from 98.5 % (4233/4296) to 89.5 % (3 846/4296) (Table 5).

The mean execution time per genome for ShigaPass was approximately 12±1 s, with the use of four processing cores and 4 GB of memory, versus 1.5 s for ShigEiFinder (using assemblies). The speed of ShigaPass analysis depends on the computational resources available and is limited principally by blast queries and number of central processing unit (CPU) threads.

Discussion

We developed ShigaPass, a new in silico tool for Shigella genoserotyping, based on the examination and characterisation of rfb, fliC and POAC genes, CRISPR loci and MLST STs (Achtman’s scheme) from the 440 Shigella genomes of the development dataset. The combination of these different genetic structures robustly recognized the different Shigella serotypes and differentiated between Shigella and non- Shigella isolates.

We compared the performance of ShigaPass with that of ShigaTyper and ShigEiFinder, two in silico tools recently described for Shigella genoserotyping. ShigaPass correctly identified 98.5 % (4 233/4 296) of the strains used in the validation dataset with a sensitivity of 100%, whereas the correct assignment rates of the other tools were 96.5 % (4 147/4 296) for ShigaTyper, 94 % (4 039/4 296) for ShigEiFinder version 1.3.2 (assemblies) and 92.9 % (3 990/4 296) for ShigEiFinder version 1.3.2 (short reads) (Table 5). ShigaTyper was designed to predict Shigella serotypes via a mapping approach, using Illumina or Oxford Nanopore reads for only two genes of the O-antigen rfb cluster, wzx and wzy [31]. The wzx/wzy genes — key components for O-antigen synthesis – are serotype-specific and frequently used in PCR-based assays for serotype identification [50]. However, with the use of only two genes, genoserotype identification is prone to failure in cases of poor read coverage. ShigaTyper can also differentiate between Shigella and EIEC on the basis of a set of criteria, including detection of the lactose permease (lacY) and lysine decarboxylase (cadA) genes, in particular. These two metabolic genes are highly variable in Shigella and EIEC, and are, therefore, not optimal for differentiating between species [10, 31, 52]. By contrast, the MLST used in ShigaPass is a standard tool in population genetics. ShigEiFinder accepts raw reads or assembled genomes, and uses the Shigella O-antigen genes database created by ShigaTyper (based on wzx and wzy genes only) for Shigella prediction, the database of E. coli O- and H- antigen genes of SerotypeFinder for EIEC prediction, and a set of 22 accessory genome genes to detect Shigella /EIEC clusters [11].

In a previous study, we evaluated the performance of ShigEiFinder version 1.2.0. We found that this tool was not optimal for predicting Shigella serotypes, particularly if the prediction was based on short reads [7]. Here, we evaluated the new version of ShigEiFinder version 1.3.2 released in April 2022. Initially, ShigEiFinder (short reads) uses the same coverage threshold (50 %) as ShigaTyper for the detection of wzx/wzy genes, but ShigEiFinder also uses the ratio of the mean mapping depth of wzx/wzy genes to the average mean mapping depth of seven HK genes (O/7HK ratio) to detect and eliminate low-level contamination or suboptimal sequencing data. ShigEiFinder version 1.3.2 adjusted the O/7HK ratio from 10 to 1 % (short reads). The combination of POAC genes used to predict S. flexneri serotypes (short reads and assemblies) was also modified. These changes increased the rates of correct Shigella serotype prediction for the validation dataset from 56.1 % (2 395/4 268) to 93.1 % (3 973/4 268) for short reads and from 81 % (3 459/4 268) to 94.1 % (4 015/4 268) for assemblies (Table 5).

ShigaPass outperforms ShigaTyper and ShigEiFinder, particularly for the identification of S. boydii and S. dysenteriae serotypes. The phenotypic identification of S. boydii and S. dysenteriae serotypes by conventional serotyping methods is especially challenging and laborious, requiring more than 40 antisera. This highlights the utility of ShigaPass as an in silico tool for the effective surveillance of these serotypes. The better overall performance of ShigaPass than of ShigaTyper and ShigEiFinder can be explained by multiple factors, including: (i) the use of an optimised rfb database covering all Shigella serogroups and serotypes, including, in particular, the new and provisional serotypes; (ii), the use of multiple genetic structures for identifying Shigella serotypes and differentiating between Shigella and EIEC; and (iii) the use of k-mers for the whole rfb cluster (excluding regions common to different serotypes) rather than just the wzx/wzy genes, thereby increasing the chances of serotype identification if the genomic sequences obtained do not cover the rfb cluster very well.

Interestingly, neither ShigaTyper nor ShigEiFinder was able to differentiate accurately between S. boydii 1 and S. boydii 20 or between S. boydii 6 and S. boydii 10 or to identify S. dysenteriae 16, 17 and prov. BEDP 02–5104, the newly identified serotypes. In our previous study, we provided insight into the representation and characterization of rfb in all Shigella serotypes, including the provisional serotypes [7]. We showed that S. boydii 1 and S. boydii 20 carry the same rfb, but that S. boydii 20 is different due to the presence of an IS at the rfb locus and the absence of the galF gene (Table S3) [7]. We therefore used and validated the galF gene as a marker for differentiating between these two serotypes. ShigaTyper uses the heparinase gene to differentiate between S. boydii 1 and S. boydii 20, whereas ShigEiFinder uses a gene encoding a hypothetical protein carried by a plasmid present in S. boydii 20 [11, 31]. We found that these two genes were unable to distinguish reliably between S. boydii 1 and S. boydii 20 sequences in some cases, particularly if the plasmid was lost during subculture. It is more difficult to differentiate between S. boydii 6 and S. boydii 10, because their rfb sequences differ only by the insertion of an IS into the wbaM gene of S. boydii 6. ShigaTyper determines whether the wbaM locus is disrupted by an IS, and ShigEiFinder relies on the detection of a single SNP in each of the wzx and wzy genes, but neither method can reliably distinguish between S. boydii 6 and S. boydii 10. A comparison of the genomes of S. boydii 6 and S. boydii 10 revealed the presence of a tauA gene on the chromosome of S. boydii 6, but not on that of S. boydii 10. However, caution is required in the interpretation of these results, because S. boydii 6 is an extremely rare serotype, and only one S. boydii 6 genome was used in our study. Finally, the lack of inclusion of rfb sequences from the new serotypes ( S. dysenteriae 16, 17 and prov. BEDP 02–5104) in the ShigaTyper and ShigEiFinder databases is probably the main reason for their inability to identify these new serotypes.

S. flexneri serotyping is laborious and relies on a panel of 12 antisera raised against specific type and group factors. Some new antisera are not commercially available, and both this and the cross-reactivity observed with some serotypes also complicates the serotyping of S. flexneri 1–5, X and Y. Serotype conversion, mediated by prophages and plasmids carrying O-antigen-modifying genes, is frequent and well documented in S. flexneri from the S3 cluster [7, 20, 27, 53]. The most common causes of discrepant results between phenotypic and genotypic methods are the instability of these mobile genetic elements and the presence of phenotypically inactive POAC genes with multiple and varied insertions, deletions or mutations [20, 21, 28, 53]. The discrepant results between the phenotypic data and those generated by ShigaPass were, therefore, not unexpected, but corresponded to only 3.9 % (69/1 753) of the S. flexneri 1–5, X and Y serotypes and subserotypes used in this study (Table 3). However, as these serotypes and subserotypes are not correlated with the true population structure of S. flexneri as determined by genomic methods [2], it is better for laboratory-based surveillance purposes to categorize strains into the seven S. flexneri phylogenetic groups (PGs) described by Connor and co-workers, which can easily be identified by cgMLST [2, 7], rather than relying exclusively on S. flexneri serotypes and subserotypes [7].

Assembly- and blast-based methods are largely dependent on sequencing and assembly methods and quality, and do not, therefore, necessarily provide unequivocal evidence of the performance of ShigaPass. False-negative results may arise in cases of missing data. The rfb and POAC genes have AT-rich sequences. Most of our sequences were obtained with the Nextera XT kit, which is known to be biased against AT-rich regions. In addition, the sequencing protocol used by our public health multi-pathogen sequencing platform may also be biased against such regions. Overall, this results in assemblies with short contigs and poorer coverage of these AT-rich regions. ShigaPass can overcome this problem with the k-mer strategy. However, we found that the accuracy of serotype prediction was lower when EnteroBase SPAdes assemblies from our validation dataset were used. This decrease in accuracy led to false-negative results, mainly because the assembly method used by EnteroBase is stringent and removes low-coverage contigs, including those containing the rfb and POAC genes [33]. It is therefore important, for optimal ShigaPass performance, not to filter out short and low-coverage contigs (especially when using sequencing protocols biased against AT-rich regions). In this study, we used only SPAdes, a widely used assembler, to generate the assemblies, with no evaluation of other assemblers. However, the short reads from our development and validation datasets could be used to evaluate the performance of ShigaPass with different assemblers if necessary. False-positive results may arise due to mixtures of two or more Shigella genomes. In this case, ShigaPass issues a comment mentioning the presence of multiple rfb genes. Furthermore, if the main rfb detected (the one with the most hits) is not consistent with other markers, the output of ShigaPass is ‘ Shigella spp.’, which may be considered a marker of genome contamination. However, such contamination may pass unnoticed, particularly in cases of mixture with different subserotypes of S. flexneri 1–5, X and Y strains. An output of ‘ Shigella spp.’ may also be obtained with genomes of poor quality, or in cases of new emerging serotypes.

In conclusion, ShigaPass is a user-friendly tool developed for the prediction of all Shigella serotypes from assembled genomes and for differentiating between Shigella and non- Shigella isolates based on the detection of ipaH, rfb, fliC and POAC genes, and MLST ST7 and CRISPR profiles. We show here that ShigaPass gives prediction rates better than those for other existing in silico tools, correctly identifying 98.5 % of 4268 Shigella genomes. The FNRC-ESS has been using ShigaPass for the routine surveillance of Shigella infections in France since October 2021, and ShigaPass is available from GitHub for public use and from Galaxy as a user-friendly web-accessible tool. Finally, we strongly recommend combining ShigaPass with cgMLST data, to optimize the laboratory and epidemiological surveillance of Shigella.

Supplementary Data

Supplementary material 1
Supplementary material 2

Funding information

Institut Pasteur; Santé Publique France; the Fondation Le Roch-Les Mousquetaires; E.H. was funded by a Fulbright Research Award. The funders had no role in study design, data collection and analysis, the decision to publish, or preparation of the manuscript.

Author contributions

Conceptualization: F.X.W., I.Y.; Methodology: I.Y., E.H., F.X.W.; Investigation and Resources: I.Y., S.L., C.R., I.C., M.L.C., L.F., M.P.G., F.X.W.; Validation: I.Y., S.L., F.X.W.; Data curation: I.Y., A.S., F.X.W.; Writing – original draft preparation: I.Y., F.X.W.; Writing – review and editing: all the authors.

Conflicts of interest

The authors declare that they have no conflicts of interest.

Footnotes

Abbreviations: A/E, attaching and effacing; cgMLST, core genome multilocus sequence typing; CPU, central processing unit; CRISPR, clustered regularly interspaced short palindromic repeats; DR, direct repeat; ECOR, E. coli reference; EIEC, entero-invasive E. coli; FNRC-ESS, French National Reference Centre for E. coli, Shigella and Salmonella; GB, gigabyte; GPL3, GNU General Public License v3; HC, hierarchical cluster; HK, housekeeping; ipaH, invasion plasmid antigen H; IS, insertion sequence; MLST, multilocus sequence typing; MSM, men who have sex with men; nt, nucleotide; PCR, polymerase chain reaction; PEtN, phosphoethanolamine; PG, phylogenetic group; pINV, virulence plasmid; P2M, Mutualized Platform for Microbiology; POAC, plasmid-encoded O-antigen modification; prov., provisional; RFLP, restriction fragment length polymorphism; SNP, single nucleotide polymorphism; ST, sequence type; WGS, whole genome sequencing.

All supporting data, code and protocols have been provided within the article or through supplementary data files. One supplementary figure and thirteen supplementary tables are available with the online version of this article.

References

  • 1.Khalil IA, Troeger C, Blacker BF, Rao PC, Brown A, et al. Morbidity and mortality due to shigella and enterotoxigenic Escherichia coli diarrhoea: the Global Burden of Disease Study 1990-2016. Lancet Infect Dis. 2018;18:1229–1240. doi: 10.1016/S1473-3099(18)30475-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Connor TR, Barker CR, Baker KS, Weill F-X, Talukder KA, et al. Species-wide whole genome sequencing reveals historical global spread and recent local persistence in Shigella flexneri. Elife. 2015;4:e07335. doi: 10.7554/eLife.07335. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Baker KS, Dallman TJ, Ashton PM, Day M, Hughes G, et al. Intercontinental dissemination of azithromycin-resistant shigellosis through sexual transmission: a cross-sectional study. Lancet Infect Dis. 2015;15:913–921. doi: 10.1016/S1473-3099(15)00002-X. [DOI] [PubMed] [Google Scholar]
  • 4.Hawkey J, Paranagama K, Baker KS, Bengtsson RJ, Weill F-X, et al. Global population structure and genotyping framework for genomic surveillance of the major dysentery pathogen, Shigella sonnei . Nat Commun. 2021;12:2684. doi: 10.1038/s41467-021-22700-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Baker KS, Dallman TJ, Behar A, Weill F-X, Gouali M, et al. Travel- and community-based transmission of multidrug-resistant Shigella sonnei lineage among International Orthodox Jewish Communities. Emerg Infect Dis. 2016;22:1545–1553. doi: 10.3201/eid2209.151953. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Ewing WH. Shigella nomenclature. J Bacteriol. 1949;57:633–638. doi: 10.1128/jb.57.6.633-638.1949. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Yassine I, Lefèvre S, Hansen EE, Ruckly C, Carle I, et al. Population structure analysis and laboratory monitoring of Shigella by core-genome multilocus sequence typing. Nat Commun. 2022;13:551. doi: 10.1038/s41467-022-28121-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Edwards PR, Ewing WH. Identification of Enterobacteriaceae. Burgess Publishing Company; 1972. Genus Shigella ; pp. 108–142. [Google Scholar]
  • 9.Pupo GM, Lan R, Reeves PR. Multiple independent origins of Shigella clones of Escherichia coli and convergent evolution of many of their characteristics. Proc Natl Acad Sci. 2000;97:10567–10572. doi: 10.1073/pnas.180094797. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Pettengill EA, Pettengill JB, Binet R. Phylogenetic analyses of Shigella and enteroinvasive Escherichia coli for the identification of molecular epidemiological markers: whole-genome comparative analysis does not support distinct genera designation. Front Microbiol. 2015;6:1573. doi: 10.3389/fmicb.2015.01573. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Zhang X, Payne M, Nguyen T, Kaur S, Lan R. Cluster-specific gene markers enhance Shigella and enteroinvasive Escherichia coli in silico serotyping. Microb Genom. 2021;7 doi: 10.1099/mgen.0.000704. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Pasqua M, Michelacci V, Di Martino ML, Tozzoli R, Grossi M, et al. The intriguing evolutionary journey of Enteroinvasive E. coli (EIEC) toward pathogenicity. Front Microbiol. 2017;8:2390. doi: 10.3389/fmicb.2017.02390. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Parsot C. Shigella spp. and enteroinvasive Escherichia coli pathogenicity factors. FEMS Microbiol Lett. 2005;252:11–18. doi: 10.1016/j.femsle.2005.08.046. [DOI] [PubMed] [Google Scholar]
  • 14.The HC, Thanh DP, Holt KE, Thomson NR, Baker S. The genomic signatures of Shigella evolution, adaptation and geographical spread. Nat Rev Microbiol. 2016;14:235–250. doi: 10.1038/nrmicro.2016.10. [DOI] [PubMed] [Google Scholar]
  • 15.Edwards PR, Ewing WH. Bergey’s Manual of Systematic Bacteriology. New York, N.Y: Elsevier Publishing Co; 1986. Edwardsand Ewing’s identification of Enterobacteriaceae . [Google Scholar]
  • 16.Chattaway MA, Schaefer U, Tewolde R, Dallman TJ, Jenkins C. Identification of Escherichia coli and Shigella species from whole-genome sequences. J Clin Microbiol. 2017;55:616–623. doi: 10.1128/JCM.01790-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Lefebvre J, Gosselin F, Ismaïl J, Lorange M, Lior H, et al. Evaluation of commercial antisera for Shigella serogrouping. J Clin Microbiol. 1995;33:1997–2001. doi: 10.1128/jcm.33.8.1997-2001.1995. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Coimbra RS, Lefevre M, Grimont F, Grimont PA. Clonal relationships among Shigella serotypes suggested by cryptic flagellin gene polymorphism. J Clin Microbiol. 2001;39:670–674. doi: 10.1128/JCM.39.2.670-674.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Coimbra RS, Grimont F, Grimont PA. Identification of Shigella serotypes by restriction of amplified O-antigen gene cluster. Res Microbiol. 1999;150:543–553. doi: 10.1016/s0923-2508(99)00103-5. [DOI] [PubMed] [Google Scholar]
  • 20.Brengi SP, Sun Q, Bolaños H, Duarte F, Jenkins C, et al. PCR-based method for Shigella flexneri serotyping: international multicenter validation. J Clin Microbiol. 2019;57:e01592-18. doi: 10.1128/JCM.01592-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Sun Q, Lan R, Wang Y, Zhao A, Zhang S, et al. Development of a multiplex PCR assay targeting O-antigen modification genes for molecular serotyping of Shigella flexneri . J Clin Microbiol. 2011;49:3766–3770. doi: 10.1128/JCM.01259-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Coimbra RS, Grimont F, Lenormand P, Burguière P, Beutin L, et al. Identification of Escherichia coli O-serogroups by restriction of the amplified O-antigen gene cluster (rfb-RFLP) Res Microbiol. 2000;151:639–654. doi: 10.1016/s0923-2508(00)00134-0. [DOI] [PubMed] [Google Scholar]
  • 23.Sun Q, Lan R, Wang J, Xia S, Wang Y, et al. Identification and characterization of a novel Shigella flexneri serotype Yv in China. PLoS One. 2013;8:e70238. doi: 10.1371/journal.pone.0070238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Sun Q, Lan R, Wang Y, Wang J, Wang Y, et al. Isolation and genomic characterization of SfI, a serotype-converting bacteriophage of Shigella flexneri . BMC Microbiol. 2013;13:39. doi: 10.1186/1471-2180-13-39. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Allison GE, Verma NK. Serotype-converting bacteriophages and O-antigen modification in Shigella flexneri . Trends Microbiol. 2000;8:17–23. doi: 10.1016/s0966-842x(99)01646-7. [DOI] [PubMed] [Google Scholar]
  • 26.Clark CA, Beltrame J, Manning PA. The oac gene encoding a lipopolysaccharide O-antigen acetylase maps adjacent to the integrase-encoding gene on the genome of Shigella flexneri bacteriophage Sf6. Gene. 1991;107:43–52. doi: 10.1016/0378-1119(91)90295-m. [DOI] [PubMed] [Google Scholar]
  • 27.Gentle A, Ashton PM, Dallman TJ, Jenkins C. Evaluation of molecular methods for serotyping Shigella flexneri . J Clin Microbiol. 2016;54:1456–1461. doi: 10.1128/JCM.03386-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Liu J, Pholwat S, Zhang J, Taniuchi M, Haque R, et al. Evaluation of molecular serotyping assays for Shigella flexneri directly on stool samples. J Clin Microbiol. 2021;59:e02455-20. doi: 10.1128/JCM.02455-20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Williamson D, Ingle D, Howden B. Extensively drug-resistant Shigellosis in Australia among Men Who Have Sex with Men. N Engl J Med. 2019;381:2477–2479. doi: 10.1056/NEJMc1910648. [DOI] [PubMed] [Google Scholar]
  • 30.Charles H, Prochazka M, Thorley K, Crewdson A, Greig DR, et al. Outbreak of sexually transmitted, extensively drug-resistant Shigella sonnei in the UK, 2021-22: a descriptive epidemiological study. Lancet Infect Dis. 2022;22:1503–1510. doi: 10.1016/S1473-3099(22)00370-X. [DOI] [PubMed] [Google Scholar]
  • 31.Wu Y, Lau HK, Lee T, Lau DK, Payne J. In silico serotyping based on whole-genome sequencing improves the accuracy of Shigella identification. Appl Environ Microbiol. 2019;85:e00165–19. doi: 10.1128/AEM.00165-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Wirth T, Falush D, Lan R, Colles F, Mensa P, et al. Sex and virulence in Escherichia coli: an evolutionary perspective. Mol Microbiol. 2006;60:1136–1151. doi: 10.1111/j.1365-2958.2006.05172.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Zhou Z, Alikhan N-F, Mohamed K, Fan Y, et al. Group the AS The enterobase user’s guide, with case studies on salmonella transmissions, yersinia pestis phylogeny, and escherichia core genomic diversity. Genome Res. 2020;30:138–152. doi: 10.1101/gr.251678.119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Lan R, Alles MC, Donohoe K, Martinez MB, Reeves PR. Molecular evolutionary relationships of enteroinvasive Escherichia coli and Shigella spp. Infect Immun. 2004;72:5080–5088. doi: 10.1128/IAI.72.9.5080-5088.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Hazen TH, Leonard SR, Lampel KA, Lacher DW, Maurelli AT, et al. Investigating the relatedness of Enteroinvasive Escherichia coli to other E. coli and Shigella isolates by using comparative genomics. Infect Immun. 2016;84:2362–2371. doi: 10.1128/IAI.00350-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Dhakal R, Wang Q, Howard P, Sintchenko V. Genome sequences of enteroinvasive Escherichia coli sequence type 6, 99, and 311 strains acquired in Asia Pacific. Microbiol Resour Announc. 2019;8:e00944-19. doi: 10.1128/MRA.00944-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Patel IR, Gangiredla J, Mammel MK, Lampel KA, Elkins CA, et al. Draft genome sequences of the Escherichia coli Reference (ECOR) collection. Microbiol Resour Announc. 2018;7:e01133-18. doi: 10.1128/MRA.01133-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Langendorf C, Le Hello S, Moumouni A, Gouali M, Mamaty A-A, et al. Enteric bacterial pathogens in children with diarrhea in Niger: diversity and antimicrobial resistance. PLoS One. 2015;10:e0120275. doi: 10.1371/journal.pone.0120275. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Criscuolo A, Brisse S. AlienTrimmer: a tool to quickly and accurately trim off multiple short contaminant sequences from high-throughput sequencing reads. Genomics. 2013;102:500–506. doi: 10.1016/j.ygeno.2013.07.011. [DOI] [PubMed] [Google Scholar]
  • 40.Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012;19:455–477. doi: 10.1089/cmb.2012.0021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Njamkepo E, Fawal N, Tran-Dien A, Hawkey J, Strockbine N, et al. Global phylogeography and evolutionary history of Shigella dysenteriae type 1. Nat Microbiol. 2016;1:16027. doi: 10.1038/nmicrobiol.2016.27. [DOI] [PubMed] [Google Scholar]
  • 42.Holt KE, Baker S, Weill F-X, Holmes EC, Kitchen A, et al. Shigella sonnei genome sequencing and phylogenetic analysis indicate recent global dissemination from Europe. Nat Genet. 2012;44:1056–1059. doi: 10.1038/ng.2369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421. doi: 10.1186/1471-2105-10-421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Machado J, Grimont F, Grimont PA. Identification of Escherichia coli flagellar types by restriction of the amplified fliC gene. Res Microbiol. 2000;151:535–546. doi: 10.1016/s0923-2508(00)00223-0. [DOI] [PubMed] [Google Scholar]
  • 45.Touchon M, Rocha EPC. The small, slow and specialized CRISPR and anti-CRISPR of Escherichia and Salmonella . PLoS One. 2010;5:e11126. doi: 10.1371/journal.pone.0011126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Ishino Y, Shinagawa H, Makino K, Amemura M, Nakata A. Nucleotide sequence of the iap gene, responsible for alkaline phosphatase isozyme conversion in Escherichia coli, and identification of the gene product. J Bacteriol. 1987;169:5429–5433. doi: 10.1128/jb.169.12.5429-5433.1987. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Rohde JR, Breitkreutz A, Chenal A, Sansonetti PJ, Parsot C. Type III secretion effectors of the IpaH family are E3 ubiquitin ligases. Cell Host Microbe. 2007;1:77–83. doi: 10.1016/j.chom.2007.02.002. [DOI] [PubMed] [Google Scholar]
  • 48.Hartman AB, Venkatesan M, Oaks EV, Buysse JM. Sequence and molecular characterization of a multicopy invasion plasmid antigen gene, ipaH, of Shigella flexneri . J Bacteriol. 1990;172:1905–1915. doi: 10.1128/jb.172.4.1905-1915.1990. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Venkatesan MM, Buysse JM, Kopecko DJ. Use of Shigella flexneri ipaC and ipaH gene sequences for the general identification of Shigella spp. and enteroinvasive Escherichia coli . J Clin Microbiol. 1989;27:2687–2691. doi: 10.1128/jcm.27.12.2687-2691.1989. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Liu B, Knirel YA, Feng L, Perepelov AV, Senchenkova SN, et al. Structure and genetics of Shigella O antigens. FEMS Microbiol Rev. 2008;32:627–653. doi: 10.1111/j.1574-6976.2008.00114.x. [DOI] [PubMed] [Google Scholar]
  • 51.Kim J, Lindsey RL, Garcia-Toledo L, Loparev VN, Rowe LA, et al. High-quality whole-genome sequences for 59 historical Shigella strains generated with PacBio sequencing. Genome Announc. 2018;6:e00282-18. doi: 10.1128/genomeA.00282-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Casalino M, Latella MC, Prosseda G, Colonna B. CadC is the preferential target of a convergent evolution driving enteroinvasive Escherichia coli toward a lysine decarboxylase-defective phenotype. Infect Immun. 2003;71:5472–5479. doi: 10.1128/IAI.71.10.5472-5479.2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Ventola E, Bogaerts B, De Keersmaecker SCJ, Vanneste K, Roosens NHC, et al. Shifting national surveillance of Shigella infections toward geno-serotyping by the development of a tailored Luminex assay and NGS workflow. MicrobiologyOpen. 2019;8:e00807. doi: 10.1002/mbo3.807. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material 1
Supplementary material 2

Articles from Microbial Genomics are provided here courtesy of Microbiology Society

RESOURCES