Abstract
Compared to short-read sequencing data, long-read sequencing facilitates single contiguous de novo assemblies and characterization of the prophage region of the genome. Here, we describe our methodological approach to using Oxford Nanopore Technology (ONT) sequencing data to quantify genetic relatedness and to look for microevolutionary events in the core and accessory genomes to assess the within-outbreak variation of four genetically and epidemiologically linked isolates. Analysis of both Illumina and ONT sequencing data detected one SNP between the four sequences of the outbreak isolates. The variant calling procedure highlighted the importance of masking homologous sequences in the reference genome regardless of the sequencing technology used. Variant calling also highlighted the systemic errors in ONT base-calling and ambiguous mapping of Illumina reads that results in variations in the genetic distance when comparing one technology to the other. The prophage component of the outbreak strain was analysed, and nine of the 16 prophages showed some similarity to the prophage in the Sakai reference genome, including the stx2a-encoding phage. Prophage comparison between the outbreak isolates identified minor genome rearrangements in one of the isolates, including an inversion and a deletion event. The ability to characterize the accessory genome in this way is the first step to understanding the significance of these microevolutionary events and their impact on the evolutionary history, virulence and potentially the likely source and transmission of this zoonotic, foodborne pathogen.
Keywords: Bacteriophage, Escherichia coli O157:H7, Illumina, Nanopore, Shiga toxin, Whole Genome sequencing
Data Summary
All FASTQ files and assemblies were submitted to the National Centre for Biotechnology Information (NCBI). All data can be found under BioProject: PRJNA315192 - https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA315192. Strain-specific details can be found in Methods under data deposition.
Impact Statement.
The use of short-read sequencing data for surveillance of gastrointestinal pathogens is well established, and the added value provided by this approach has been well documented. Here, we begin to explore how supplementing short-read sequencing data with long-read sequencing data (Oxford Nanopore Technology) can add value to public health surveillance of STEC, including outbreak detection and investigation. We describe our methodological approach to the analysis of the accessory genomes of four temporally related cluster isolates of STEC O157:H7. The comparison of the ONT sequencing data with the Illumina sequencing data confirmed the close genetic relatedness of the four outbreak isolates. Although between the outbreak strains the prophage content was stable, minor structural alterations were observed in two prophages in one of the isolates. Long-read sequencing data provides an opportunity to explore the accessory genome, and to better understand the significance of these microevolutionary events.
Introduction
Shiga toxin-producing Escherichia coli (STEC) O157:H7 is a human, gastrointestinal pathogen that colonizes the gut of healthy ruminants, particularly cattle and sheep. Symptoms in humans range from mild diarrhoea to include abdominal cramps, vomiting and severe bloody diarrhoea. In 5–15 % of cases, the infection can lead to the development of haemolytic uremic syndrome (HUS), a severe multi-system syndrome [1], that can be fatal, particularly in young children and the elderly. STEC O157:H7 has a very low infectious dose (10–100 organisms) and transmission to humans occurs through consumption of contaminated food or water, direct or indirect contact with animals or their environment and through person-to-person spread [1].
In 2015, Public Health England (PHE) implemented high-throughput, real-time sequencing for the surveillance of gastrointestinal pathogens, including STEC O157:H7. The detection of SNPs by mapping short reads to a single reference genome is used to identify linked cases and outbreaks of infectious disease. High-quality SNPs are identified based on validated thresholds of mapping quality, mapping depth, and variant ratio. SNPs that do not meet these criteria, positions that have no aligned reads, or invariant positions with depth or mapping quality less than the specified thresholds are termed ‘ignored positions’. Consequently, a high proportion of repetitive and homologous features, including prophage, are masked from the analysis, and little is known about the variation in prophage content of STEC O157:H7 genomes.
STEC O157:H7 has a large accessory genome, with approximately 10–15 % of the genome comprised of prophage [2, 3]. Furthermore, the defining characteristic of the STEC group, the Shiga toxin genes (stx) are bacteriophage encoded [4]. Therefore, analysis of prophage content, loss and acquisition of bacteriophage and structural rearrangements within prophage regions contributes to our understanding of the evolutionary history, virulence and potentially the likely source and transmission of this zoonotic, foodborne pathogen. Long-read sequencing technologies, such as Oxford Nanopore Technology (ONT) have been shown to achieve improved de novo assemblies and facilitate more complete characterization of the accessory genome [5, 6] including prophage regions [6].
In August 2017, a cluster of four cases (A–D) infected with genetically related strains of STEC O157:H7 was identified by the national Gastrointestinal Infections Department at Public Health England (https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/765498/STEC_O157_PT21.28_Outbreak_Report.pdf). All four cases were identified as STEC O157:H7 phage type 21/28 harbouring stx subtypes, stx2a and stx2c, and belonging to sub-lineage Ic [7]. The strains possessed the stx2a toxin subtype, known to be associated with more severe disease and HUS and, despite the small numbers of cases, a multi-agency investigation was undertaken. Handling raw pet food, specifically tripe (the edible lining of the stomach of cattle and sheep), was identified as the cause of the outbreak. The SNP-type profile derived from the short-read Illumina sequencing data for three cases were identical and one isolate (from case B) differed by one SNP from the other isolates (Fig. 1). We describe our methodological approach to the analysis of ONT sequencing data to further quantify genetic relatedness and to look for microevolutionary events in the core and accessory genomes to assess the within-outbreak variation of four genetically and epidemiologically linked isolates.
Methods
Short-read sequencing on the Illumina platform and core SNPs analysis
Genomic DNA was extracted from cultures of STEC O157:H7 using the Qiagen Qiasymphony (Qiagen, Hilden, Germany). The sequencing library was prepared using the Nextera XP kit (Illumina, San Diego, USA) for sequencing on the Illumina HiSeq 2500 (Illumina, San Diego, USA) instrument run with the fast protocol. High-quality trimmed (leading and trailing trimming at <Q30 using Trimmomatic v0.27 [8]). Illumina reads (read length 80–100 bp) were mapped to the STEC O157:H7 reference genome Sakai (GenBank accession BA000007) using BWA-MEM v0.7.13 [9]. The Sakai STEC O157:H7 reference genome (BA000007) contains 18 prophages of which two are Stx-encoding (stx1a and stx2a) and six prophage like-regions including the locus of enterocyte effacement [10]. SNPs were identified using GATKv2.6 in unified genotyper mode [11]. Core-genome positions that had a high-quality SNP (>90 % consensus, minimum depth 10×, MQ >=30) in at least one isolate were extracted for further analysis. Genomes were compared to the sequences held in the PHE STEC O157:H7 WGS database, (SnapperDB v0.2.5. STEC O157:H7) and isolates with five SNP differences or less within their core genome were considered closely related and likely to have an epidemiological link [7, 12].
Long read sequencing using ONT and data processing
Genomic DNA was extracted and purified using two methods. The first was the Promega Wizard Genomic DNA Purification Kit (Promega, Madison, USA) with minor alterations including doubled incubation times, no vigorous mixing steps (performed by inversion) and elution into 50 µl of double processed nuclease free water (Sigma-Aldrich, St. Louis, USA). The second method was the Revolugen Fire Monkey DNA extraction kit (Revolugen, Glossop, UK) to the manufacturer’s instructions. DNA was quantified using a Qubit and the HS (High sensitivity) dsDNA Assay Kit (Thermofisher Scientific, Waltham, USA) to the manufacturer’s instructions. Library preparation was performed using the Native Barcoding kit (SQK-LSK108 and EXP-NBD103) (Oxford Nanopore Technologies, Oxford, UK). The prepared library was loaded on a FLO-MIN106 R9.4.1 flow cell (Oxford Nanopore Technologies, Oxford, UK) and sequenced using the MinION for 48 h.
Data was produced in a raw FAST5 format was base-called and de-multiplexed using Guppy V3.2.6 (Oxford Nanopore Technologies) into FASTQ format and grouped in each samples’ respective barcode. Samples were re-demultiplexed using Deepbinner v0.2.0 [13]. Run metrics were generated using Nanoplot v1.8.1 [14]. The barcode and y-adapter from each sample’s reads were trimmed, and chimeric reads split using Porechop v0.2.1 [15]. Finally, the trimmed reads were filtered using Filtlong v0.1.1 [16] with the following parameters; min_length=1000, keep_percent=90 and target_bases=550 Mbp, to generate approximately 100× coverage of the STEC genome with the longest and highest quality reads.
De novo assembly, polishing, reorientation and annotation
Trimmed and filtered ONT FASTQ files were assembled using Flye v2.6 [17]. The assembly for each sample that had the highest N50 and lowest number of contigs with the assembly size (between 5.3–6.0 Mbp) were taken forward. Polishing of the assemblies was performed in a three-step process firstly, using Nanopolish v0.11.1 [18] using both the trimmed ONT FASTQs and FAST5s for each respective sample accounting for methylation using the --methylation-aware=dcm,dam, --min-candidate-depth=10 and --min-candidate-frequency=0.1. Secondly, Pilon v1.22 [19] with --minmq=0, --minqual=0 and --mindepth=0.05 set. Illumina FASTQ reads were used as the query dataset with the use of BWA v0.7.17 [9] and Samtools v1.7 [20]. Finally, Racon v1.2.1 [21] (--error-threshold=0.3 and --quality-threshold=10) also using BWA v0.7.17 [9] was used with the Illumina reads to produce a final assembly for each sample. As all assemblies were circularized and closed, they were reoriented to start at the dnaA gene (NC_000913) from E. coli K12, using the --fixstart parameter in circlator v1.5.5 [22]. Prokka v1.13 [23] with the use of a personalized database (https://github.com/gingerdave269/prophage_DB) was used to annotate the final assemblies.
Prophage detection, excision and processing
Prophages across all samples were detected using the Phage Search Tool (PHASTER) [24]. Prophage sequences were extracted from each samples’ chromosome and this occurred regardless of prophage size or quality. Any detected prophages separated by less than 4 kbp were conjoined into a single phage using Propi v0.0.1 as described in Shaaban et al. [25]. Prophages were re-annotated using Prokka v 1.13 [23]. Prophages were compared using Easyfig v2.2.5 [26].
Mash and prophage comparison
Mash v2.2 [27] was used to sketch (sketch length 1000, kmer length, 21) all extracted prophages in the samples sequenced in this study and all prophages found in the Sakai STEC reference genome (BA000007). The pairwise Jaccard distance between the prophages was calculated and a neighbour-joining tree computed and visualised using FigTree v1.4.4.
Variant calling and phylogenetic tree construction
For reference-based variant calling both Illumina and ONT FASTQ reads were mapped to the Sakai STEC O157 reference genome (BA000007) using BWA v0.7.3 and minimap2 v2.2, respectively [28]. VCFs were produced using GATK v2.6.5 UnifiedGenotyper [11]. Core-genome positions that had a high-quality SNP ([>90 % for Illumina] [>80 % for ONT] consensus, minimum depth 10×, MQ >=30) in at least one isolate were extracted for further analysis. Any variants called at positions that were within the known prophages in Sakai were masked from further analyses. 5-methylcytosine positions were identified using Nanopolish V0.11.1 [18] and methylated positions were then masked from the ONT VCFs as described in Greig et al. [29]. Masking the prophage regions and relative methylated positions of the reference genome leads to an 81.0 % core genome to compare for both Illumina and Nanopore data for the four outbreak samples. The maximum-likelihood phylogenetic tree was constructed by RAxML v8.1.17 [30] using an alignment generated from SnapperDB [12] that recombination had been accounting for by Gubbins v2.00 [31]. Visualization of the phylogenetic tree was performed using FigTree v1.4.4 (Fig. 1). To detect false positive/negative SNPs called by Illumina reads, discrepant variant positions between Illumina and Nanopore relative to the reference genome were extracted. Those variants that were called in paralogous sequences that also had a lower-than-average mapping quality were then masked in the alignment. To be included in the masking process the false called variant must be present in all the alignment of all samples in the study.
Data deposition
All FASTQ and assemblies were submitted to the National Centre for Biotechnology Information (NCBI). Illumina FASTQ accessions: case A: SRR6052868, case B: SRR7223105, case C: SRR6001344 and case D: SRR6052929. Nanopore FASTQ accessions: case A: SRR9987849, case B: SRR9987851, case C: SRR9987850 and case D: SRR7477813. All FASTQs can be found under BioProject: PRJNA315192. Assembly accessions (chromosome and plasmid): case A: CP043011 and CP043012, case B: CP043015 and CP043016, case C: CP043019 and CP043020, case D: CP043025 and CP043023. All FASTQs can be found under BioProject: PRJNA315192.
Results and Discussion
Assemblies generated by long-read sequencing data and variant calling
The assemblies for each isolate were resolved into two contigs comprising the chromosome and pO157 plasmid. The chromosome length (plasmid sizes and type) was 5 486 665 bp (91 449 bp IncFIB), 5 487 004 bp (91 445 bp IncFIB and 57 390 bp IncI2), 5 486 935 bp (91 445 bp IncFIB and 17 157 bp unknown type) and 5 424 337 bp (91 443 bp IncFIB) and for cases A to D respectively. Sample B contained an extra 57kbp IncI2 plasmid.
In total, variant calling using GATK and SnapperDB identified between zero and six SNPs when comparing each sample with both sequencing technologies to one another (Table 1). Concerning the Illumina generated sequence data, there was only a single SNP detected between each sample. This SNP was located in sample B at position 2 578 517 relative to the reference genome within a gene that encodes a proton conductor (Table 1). For the ONT generated sequence data, there were no SNPs detected between all four samples relative to the reference genome (Table 2). There were five SNPs that differed between ONT and Illumina datasets relative to the reference genome. Of the five discrepant SNPs, one was called as a variant in the Illumina data and four were called as variants in the ONT data (Table 2).
Table 1.
Sample |
Case A ONT |
Case B ONT |
Case C ONT |
Case D ONT |
Case A Illumina |
Case B Illumina |
Case C Illumina |
Case D Illumina |
---|---|---|---|---|---|---|---|---|
Case A ONT |
/ |
0 |
0 |
0 |
5 |
6 |
5 |
5 |
Case B ONT |
0 |
/ |
0 |
0 |
5 |
6 |
5 |
5 |
Case C ONT |
0 |
0 |
/ |
0 |
5 |
6 |
5 |
5 |
Case D ONT |
0 |
0 |
0 |
/ |
5 |
6 |
5 |
5 |
Case A Illumina |
5 |
5 |
5 |
5 |
/ |
1 |
0 |
0 |
Case B Illumina |
6 |
6 |
6 |
6 |
1 |
/ |
0 |
0 |
Case C Illumina |
5 |
5 |
5 |
5 |
0 |
0 |
/ |
0 |
Case D Illumina |
5 |
5 |
5 |
5 |
0 |
0 |
0 |
/ |
Table 2.
Position in reference genome |
Base in reference genome |
Base in Illumina data |
Base in nanopore data |
|
---|---|---|---|---|
270 595 |
C |
A |
C |
False positive by Illumina |
379 516 |
A |
G |
A |
False negative by Nanopore |
2 033 176 |
T |
G |
T |
False negative by Nanopore |
4 709 195 |
A |
A |
G |
False positive by Nanopore |
4 901 209 |
A |
A |
G |
False positive by Nanopore |
All five of the discrepant SNPs between both technologies were false positive or false negative calls. Ambiguous mapping of short-read Illumina sequences to paralogous sequences in the reference genome leads to the introduction of a single false positive SNP. The remaining four false positive or false negative SNPs were generated from a known systemic error during the base-calling process of homopolymer regions in ONT sequencing, resulting in small single or double base insertions within the reads [29, 32]. Correction for these systemic errors and the ambiguous mapping error confirmed a single SNP in sample B in the Illumina dataset and all other samples (and technologies) had no SNPs different from each other.
The comparison of the ONT sequencing data with the Illumina sequencing data confirmed the close genetic relatedness of the four outbreak isolates, as only one additional SNP was identified between the outbreak strain genomes. This comparison highlighted the limitations associated with each technology, specifically the base-calling errors related to homopolymer detection observed in ONT data and the importance of masking of homologous and paralogous regions in the Illumina data.
Analysis of prophage content of the outbreak isolates and comparison with the Sakai reference genome
Of 16 prophage regions, 15 were shared between the four outbreak isolates (Tables 3 and 4, Fig. 2), with sample D containing an extra prophage (Table 3). Prophage size ranged from 8 to 145 kbp. Seven of the 16 prophages showed similarity to prophages in the Sakai reference genome; all seven prophages had 98–100 % nucleotide identity and coverage (Table 3 and Fig. 2). Prophages 3, 4, 5, 7, 10, 14 and 15 in the outbreak isolates matched >98 % similarity and coverage to Sakai prophages (Sp3, Sp4, Sp6, Sp8, Sp14, Sp16 and Sp17, respectively), and shared the same bacteriophage insertion (SBI) site (ybhC, yccA, potC, icd, serU, argW and ssrA, respectively). Case D had an extra prophage compared to the other three samples designated prophage A (Figs 2 and 3, Table 3). This prophage was 28 kbp in length and integrated at the hipA gene.
Table 3.
Prophage detected |
Gene 5′ to prophage |
Gene 3′ to prophage |
Size (bp) in sample A |
Size (bp) in sample B |
Size (bp) in sample C |
Size (bp) in sample D |
(% similarity /%coverage) to Sakai prophages |
Position in sample A |
Position in sample B |
Position in sample C |
Position in sample D |
---|---|---|---|---|---|---|---|---|---|---|---|
1 |
lexA |
aphA |
22 035 |
22 035 |
22 034 |
22 036 |
/ |
403 708–425 743 |
403 729–425 764 |
4 03 712–4 25 746 |
403 708–425 744 |
2∗ |
tRNA-Thr (cgt)† |
prgR |
26 826 |
26 823 |
26 823 |
26 821 |
/ |
1 101 413–1 128 239 |
1 101 440–1 128 263 |
1 101 421–1 128 244 |
1 101 414–1 128 235 |
3 |
ybhC† |
ybhB |
38 095 |
38 095 |
38 096 |
38 095 |
Sp3 (99 %/100 %) |
1 697 277–1 735 372 |
1 697 420–1 735 515 |
1 697 397–1 735 493 |
1 697 388–1 735 483 |
4 |
yccA† |
tRNA-Ser (tga) |
48 257 |
48 277 |
48 265 |
48 263 |
Sp4 (99 %/100 %) |
1 967 370–2 015 627 |
1 967 513–2 015 790 |
1 967 491–2 015 756 |
1 967 481–2 015 744 |
5 |
potC† |
potB |
47 889 |
47 679 |
47 879 |
47 896 |
Sp6 (99 %/100 %) |
2 283 891–2 331 780 |
2 284 054–2 331 733 |
2 284 019–2 331 898 |
2 284 012–2 331 908 |
6 |
roxA† |
phoQ |
10 436 |
10 436 |
10 436 |
10 436 |
/ |
2 337 022–2 347 458 |
2 336 975–2 347 411 |
2 337 140–2 347 576 |
2 337 150–2 347 586 |
7 |
icd† |
caeB |
40 992 |
40 991 |
40 992 |
40 991 |
Sp8 (99 %/100 %) |
2 356 230–2 397 222 |
2 356 183–2 397 174 |
2 356,348–2 397 340 |
2 356 358–2 397 349 |
8∗ |
ompW† |
rspR |
145 151 |
144 981 |
145 040 |
53 636 |
/ |
2 495 566–2 640 717 |
2 495 518–2 640 499 |
2 495 685–2 640 725 |
2 495 693–2 549 329 |
A |
hipA |
ydeP |
/ |
/ |
/ |
28 513 |
/ |
/ |
/ |
/ |
2 589 222–2 617 735 |
9 |
trpA |
rspA |
59 339 |
59 340 |
59 322 |
59 863 |
/ |
2 918 047–2 977 386 |
2 917 834–2 977 174 |
2 918 061–2 977 383 |
2 855 169–2 915 032 |
10 |
yodB |
tRNA-Ser (cga)† |
43 681 |
43 694 |
43 684 |
43 690 |
Sp14 (99 %/98 %) |
3 370 042–3 413 723 |
3 369 847–3 413 541 |
3 370 056–3 413 740 |
3 307 170–3 350 860 |
11 (stx2c) |
yeeA |
tnpA† |
55 650 |
56 910 |
55 652 |
55 655 |
/ |
3 457 661–3 513 311 |
3 456 222–3 513 132 |
3 457 677–3 513 329 |
3 394 797–3 450 452 |
12 |
yehV |
yehV† |
45 720 |
45 725 |
45 732 |
45 719 |
/ |
3 654 766–3 700 486 |
3 654 584–3 700 309 |
3 654 782–3 700 514 |
3 591 904–3 637 623 |
13 (stx2a) |
yfdC |
argW† |
60 239 |
60 237 |
60 239 |
60 242 |
/ |
3 948 563–4 008 802 |
3 948 386–4 008 623 |
3 948 591–4 008 830 |
3 885 700–3 945 942 |
14 |
argW† |
lacY |
8233 |
8233 |
8233 |
8233 |
Sp16 (100 %/100 %) |
4 009 234–4 017 467 |
4 009 055–4 017 288 |
4 009 262–4 017 495 |
3 946 374–3 954 607 |
15 |
ssrA |
alpA† |
22 107 |
22 103 |
22 098 |
22 106 |
Sp17 (99 %/99 %) |
4 294,770–4 316 877 |
4,294,584–4 316 687 |
4 294 801–4 316 899 |
4 231 915–4 254 021 |
∗Refers to prophages that appear to be compound prophages (i.e. two or more prophages that are sequential, with intact integrase genes).
†Refers to the end in which the Integrase gene (IntA) is located.
Table 4.
Prophage- like element |
Gene 5’ |
Gene 3’ |
Size (bp) in sample A |
Size (bp) in sample B |
Size (bp) in sample C |
Size (bp) in sample D |
(% similarity /%coverage) to Sakai SpLE’s |
Position in sample A |
Position in sample B |
Position in sample C |
Position in sample D |
---|---|---|---|---|---|---|---|---|---|---|---|
PLE1 |
tRNA-Leu |
int |
9530 |
9530 |
9530 |
9530 |
SpLE5 (99 %/97 %) |
650 711–660 241 |
650 730–660 260 |
650 601–660 131 |
650 715–660 245 |
PLE2 |
int |
nanS |
34 465 |
34 981 |
34 465 |
34 465 |
SpLE6 (99 %/100 %) |
660 260–694 725 |
659 763–694 744 |
660 150–694 615 |
660 264–694 729 |
PLE3 |
ycdU |
tRNA-Ser |
85 906 |
85 906 |
85 906 |
85 911 |
SpLE1 (99 %/100 %) |
2 112 819–2 198 725 |
2 113 064–2 198 970 |
2 112 913–2 198 819 |
2 113 028–2 198 939 |
PLE4 |
cobU |
yeeB |
15 044 |
15 045 |
15 044 |
15 046 |
SpLE2 (99 %/100 %) |
3 439 339–3 454 383 |
3 439 883–3 454 928 |
3 439 719–3 454 763 |
3 377 102–3 392 148 |
PLE5 |
tRNA-Phe |
pitB |
23 334 |
23 333 |
23 333 |
23 332 |
SpLE3 (99 %/100 %) |
4 664 210–4 687 544 |
4 664 345–4 687 678 |
4 664 588–4 687 921 |
4 601 991–4 625 323 |
PLE6 |
selC |
yicL |
50 390 |
50 389 |
50 390 |
50 390 |
SpLE4 (99 %/79 %) |
5 392 598–5 442 988 |
5 392 972–5 443 361 |
5 393 215–5 443 605 |
5 330 617–5 381 007 |
There were nine prophages in the outbreak isolates and 11 in Sakai that were <50 % homologous or do not match at all. Five prophages in the outbreak isolates share the same SBI site with prophages in the Sakai reference genome; prophage 6 shares phoQ with Sp6, prophage 12 shares yehV with Sp15. There appears to be a homologous recombination event in between prophages 8 and 9 relative to Sp11 and 12. Prophage 2 labelled as a compound prophage shares thrW where Sp1 and Sp2 are located. The sites of prophages 1, 11 and 13 located at lexA, tnpA and argW, respectively, in the outbreak strain are vacant in the Sakai reference genome whereas the sites for Sp5, Sp9, Sp10, Sp13 and Sp18 located at wrbA, yciD, ydaO, leuZ and a sorbitol operon, respectively in Sakai are vacant in the outbreak samples.
In the outbreak strain, stx2a and stx2c were encoded on prophages 11 and 13, inserted at argW and tnpA, respectively. For the stx2c encoding prophage the known SBI site, sbcB, as previously described for a PT21/28 STEC [15, 29, 33] has been split by a short 2.7 kbp insertion sequence (IS629) hence the designation tnpA which encodes a transposase.
The stx2a-encoding prophage detected in the strain described in this study had ~30 % coverage but greater than 97 % nucleotide similarity with Sp5, which is the Sakai stx2a-encoding prophage. The regions of high similarity included the stx encoding genes, Q region, nin region, DNA replication, origin and general recombination and the prophage structural regions differed including head, tail and tail fibres/tip regions. The stx2c-encoding prophage was not present in the Sakai reference strain and so no comparison was possible. Unlike Sakai, the samples sequenced in this study did not contain a stx1a-encoding prophage however, Sp15 which is a stx1a-encoding prophage was structurally similar to that of prophage 11 and shared the same SBI site, yehV (Table 3, Fig. 3).
The strain of STEC O157:H7 linked to the tripe outbreak sequenced in this study and the Sakai reference strain [34] belonging to two different sub-lineages, sub-lineages Ic and Ia, respectively, were isolated in geographically distinct regions, 20 years apart. The prophage commonality shared between the two strains indicates some stability of the non-stx-encoding prophage content over time and space (Fig. 3). In contrast, the dynamic nature of the stx-encoding phage is well documented, and variation of stx profiles in strains belonging to the same lineage that are globally distributed but also in closely related strains at the local level has been described [7, 34, 35]. Previous studies charting the evolutionary history of STEC O157:H7 propose the lineage I progenitor strain has stx2c only [7]. At some point during its evolutionary history, the Sakai outbreak strain appears to have lost the stx2c-encoding prophage and acquired a stx1a-encoding (which is similar to the stx-negative prophage 11 in this study) and a stx2a-encoding prophage, although the order of these events is unclear. The acquisition of stx2a-encoding prophages by sub-lineage 1 c in the UK approximately 25–30 years ago is well described and resulted in the change in PT from PT32 to PT21/28 [7, 25, 34]. The stx2a-encoding prophage (prophage 12) and Sp5 (Sakai’s stx2a-encoding prophage) share only 40 % of hashes via mash and both have different SBI sites (argW and wrbA, respectively).
Within-outbreak comparison of the prophage regions of the four outbreak isolates
The chromosomes of the outbreak isolates were aligned. Genome rearrangements were identified within prophage 2, where cases A and B differ from C and D (Fig. 4), and in prophages 8 in the sequence linked to case D with respect the other outbreak sequences (Fig. 5). In prophage 2, a 1739 bp inversion was identified involving two prophage tail genes surrounding a hypothetical gene (yfdK) (Fig. 4). In prophage 8, a deletion event was observed (Fig. 5). Prophage 8 was a large compound prophage (53 kpb in the sequence linked to case D and 145 kbp in the remaining samples) containing at least three separate prophages positioned sequentially without any chromosomal sequence separating them. In the sequenced linked to case D, there appears to be 92 kbp deletion (relative to the three other samples). The 92 kbp deletion contained almost two full prophage sequences, making up two sets of structural bacteriophage genes, regulatory genes, lysis genes and site-to-site recombination genes.
Within the four outbreak strains, the prophage content was equivalent except for the deleted prophage sequence in prophage 8 and the acquisition of another prophage in case D and a recombination event in prophage 2 in cases A and B relative to C and D were identified. Without a better understanding of the expected variation within prophage in STEC O157:H7 in the source population, specifically in this case the bovine gastrointestinal tract, it is difficult to be certain if these microevolutionary events represent meaningful differences between isolates. Once colonized, cattle may shed STEC O157:H7 for many months [36], and the genetic changes including the horizontal exchange of genetic information and genomic recombination/rearrangements will occur in the bacterial genomes over that time [10, 25]. Although currently little is known about the selection pressures and population dynamics of STEC O157:H7 in the bovine reservoir, microevolutionary events such as these, are unlikely to reflect a different source.
Summary
The advantages of using short-read WGS technologies for routine surveillance and the detection and risk management of outbreaks of STEC O57:H7 is well-established [36–38]. For example, it is unlikely that this small, nationally dispersed cluster of the cases of PT21/28, a commonly reported PT in the UK, would have been investigated prior to the implementation of WGS. However, due to the high prophage content in STEC O157:H7, assembling the genome into one contig is challenging and the utility of accessing information from the STEC accessory genome during an outbreak investigation is yet to be fully explored. In this study, we describe our methodological approach to the comparison of the accessory genomes of four temporally related cluster isolates of STEC O157:H7 epidemiologically linked to exposure to raw tripe. Comparison of Illumina with ONT sequencing data highlighted the limitations of SNP detection associated with both technologies, however, the analysis of the ONT data confirmed the close genetic relatedness demonstrated by the Illumina data. Although the within-outbreak prophage content was stable, minor structural alterations were observed in two prophages in one of the isolates. The ability to characterize the accessory genome in this way is the first step to understanding the significance of these microevolutionary events and their impact on relatedness [33, 39], the evolutionary history, virulence, and potentially the likely source and transmission [40] of this zoonotic, foodborne pathogen.
Funding information
The research was part funded by the National Institute for Health Research Health Protection Research Unit in Gastrointestinal Infections at University of Liverpool in partnership with Public Health England (PHE), in collaboration with University of East Anglia, University of Oxford and the Quadram Institute. Claire Jenkins, David Greig and Timothy Dallman are based at Public Health England. The views expressed are those of the authors and not necessarily those of the National Health Service, the NIHR, the Department of Health or Public Health England.
Acknowledgements
We would like to thank Professor David Gally at the Roslin Institute, University of Edinburgh, for his critical review of the early stages of this project.
Author contributions
T.J.D. and C. J. conceptualized the project. D.R.G. performed DNA extractions, library preparations, sequencing of isolates, data processing, genome assembly, genome polishing, genome annotation and created the Easyfig diagrams. T.J.D. performed prophage comparison using Mash and wrote associated scripts. D.R.G. performed the SNP comparison using SnapperDB and made the phylogenetic tree. D.R.G., T.J.D. and C.J., wrote the original manuscript. D.R.G., T.J.D., C.J. and S.E.G. reviewed and edited the manuscript. T.J.D. and C.J. supervised D.R.G.
Conflicts of interest
The authors declare that there are no conflicts of interest.
Footnotes
Abbreviations: GATK, genome analysis toolkit; HUS, haemolytic uraemic syndrome; LEE, locus of enterocyte effacement; MQ, mapping quality; NCBI, national centre for biotechnology information; ONT, Oxford Nanopore Technology; PHE, Public Health England; PLE, prophage-like element; PT, phage type; SBI, Shiga toxin bacteriophag insertion; STEC, Shiga toxin-producing Escherichia coli; WGS, whole genome sequencing.
All supporting data, code and protocols have been provided within the article or through supplementary data files.
References
- 1.Launders N, Byrne L, Jenkins C, Harker K, Charlett A, et al. Disease severity of Shiga toxin-producing E. coli O157 and factors influencing the development of typical haemolytic uraemic syndrome: a retrospective cohort study, 2009-2012. BMJ Open. 2016;6:e009933. doi: 10.1136/bmjopen-2015-009933. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Eppinger M, Mammel MK, Leclerc JE, Ravel J, Cebula TA. Genomic anatomy of Escherichia coli O157:H7 outbreaks. Proc Natl Acad Sci U S A. 2011;108:20142–20147. doi: 10.1073/pnas.1107176108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Ogura Y, Mondal SI, Islam MR, Mako T, Arisawa K, et al. The Shiga toxin 2 production level in enterohemorrhagic Escherichia coli O157:H7 is correlated with the subtypes of toxin-encoding phage. Sci Rep. 2015;5:16663. doi: 10.1038/srep16663. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Byrne L, Adams N, Jenkins C. Association between Shiga Toxin-Producing Escherichia coli O157:H7 stx gene subtype and disease severity, England, 2009-2019. Emerg Infect Dis. 2020;26:2394–2400. doi: 10.3201/eid2610.200319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Latif H, Li HJ, Charusanti P, Palsson Bernhard Ø, Aziz RK. A Gapless, Unambiguous genome sequence of the enterohemorrhagic Escherichia coli O157:H7 Strain EDL933. Genome Announc. 2014;2:pii: e00821–14. doi: 10.1128/genomeA.00821-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Asadulghani M, Ogura Y, Ooka T, Itoh T, Sawaguchi A, et al. The defective prophage pool of Escherichia coli O157: prophage-prophage interactions potentiate horizontal transfer of virulence determinants. PLoS Pathog. 2009;5:e1000408. doi: 10.1371/journal.ppat.1000408. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Dallman TJ, Ashton PM, Byrne L, Perry NT, Petrovska L, et al. Applying phylogenomics to understand the emergence of Shiga-toxin-producing Escherichia coli O157:H7 strains causing severe human disease in the UK. Microb Genom. 2015;1:e000029. doi: 10.1099/mgen.0.000029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for illumina sequence data. Bioinformatics. 2014;30:2114–2120. doi: 10.1093/bioinformatics/btu170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Li H, Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010;26:589–595. doi: 10.1093/bioinformatics/btp698. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Hayashi T, Makino K, Ohnishi M, Kurokawa K, Ishii K, et al. Complete genome sequence of enterohemorrhagic Escherichia coli O157:H7 and genomic comparison with a laboratory strain K-12. DNA Res. 2001;8:11–22. doi: 10.1093/dnares/8.1.11. [DOI] [PubMed] [Google Scholar]
- 11.McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. doi: 10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Dallman T, Ashton P, Schafer U, Jironkin A, Painset A, et al. SnapperDB: a database solution for routine sequencing analysis of bacterial isolates. Bioinformatics. 2018;34:3028–3029. doi: 10.1093/bioinformatics/bty212. [DOI] [PubMed] [Google Scholar]
- 13.Wick RR, Judd LM, Holt KE. Deepbinner: demultiplexing barcoded Oxford nanopore reads with deep convolutional neural networks. PLoS Comput Biol. 2018;14:e1006583. doi: 10.1371/journal.pcbi.1006583. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.De Coster W, D'Hert S, Schultz DT, Cruts M, Van Broeckhoven C. NanoPack: visualizing and processing long-read sequencing data. Bioinformatics. 2018;34:2666–2669. doi: 10.1093/bioinformatics/bty149. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Wick RR Porechop. https://github.com/rrwick/Porechop
- 16.Wick RR Filtlong. https://github.com/rrwick/Filtlong
- 17.Kolmogorov M, Yuan J, Lin Y, Pevzner PA. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol. 2019;37:540–546. doi: 10.1038/s41587-019-0072-8. [DOI] [PubMed] [Google Scholar]
- 18.Loman NJ, Quick J, Simpson JT. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat Methods. 2015;12:733–735. doi: 10.1038/nmeth.3444. [DOI] [PubMed] [Google Scholar]
- 19.Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One. 2014;9:e112963. doi: 10.1371/journal.pone.0112963. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.H L, Handsaker B, Wysoker A, Fennell T, Ruan J, et al. 1000 genome project data processing subgroup. The sequence alignment/map (SAM) format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Vaser R, Sović I, Nagarajan N, Šikić M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 2017;27:737–746. doi: 10.1101/gr.214270.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Hunt M, Silva ND, Otto TD, Parkhill J, Keane JA, et al. Circlator: automated circularization of genome assemblies using long sequencing reads. Genome Biol. 2015;16:294. doi: 10.1186/s13059-015-0849-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014;30:2068–2069. doi: 10.1093/bioinformatics/btu153. [DOI] [PubMed] [Google Scholar]
- 24.Arndt D, Grant JR, Marcu A, Sajed T, Pon A, et al. PHASTER: a better, faster version of the PHAST phage search tool. Nucleic Acids Res. 2016;44:W16–W21. doi: 10.1093/nar/gkw387. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Shaaban S, Cowley LA, McAteer SP, Jenkins C, Dallman TJ, et al. Evolution of a zoonotic pathogen: investigating prophage diversity in enterohaemorrhagic Escherichia coli O157 by long-read sequencing. Microb Genom. 2016;2:e000096. doi: 10.1099/mgen.0.000096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Sullivan MJ, Petty NK, Beatson SA. Easyfig: a genome comparison visualizer. Bioinformatics. 2011;27:1009–1010. doi: 10.1093/bioinformatics/btr039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016;17:132. doi: 10.1186/s13059-016-0997-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–3100. doi: 10.1093/bioinformatics/bty191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Greig DR, Jenkins C, Gharbia S, Dallman TJ. Comparison of single-nucleotide variants identified by illumina and Oxford nanopore technologies in the context of a potential outbreak of Shiga toxin-producing Escherichia coli . Gigascience. 2019;8 doi: 10.1093/gigascience/giz104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30:1312–1313. doi: 10.1093/bioinformatics/btu033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Croucher NJ, Page AJ, Connor TR, Delaney AJ, Keane JA, et al. Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Res. 2015;43:e15. doi: 10.1093/nar/gku1196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Wick RR, Judd LM, Holt KE. Performance of neural network basecalling tools for Oxford nanopore sequencing. Genome Biol. 2019;20:129. doi: 10.1186/s13059-019-1727-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Greig DR, Jenkins C, Dallman TJ. A Shiga Toxin-Encoding prophage recombination event confounds the phylogenetic relationship between two isolates of Escherichia coli O157:H7 from the same patient. Front Microbiol. 2020;11:588769. doi: 10.3389/fmicb.2020.588769. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Makino K, Ishii K, Yasunaga T, Hattori M, Yokoyama K, et al. Complete nucleotide sequences of 93-kb and 3.3-kb plasmids of an enterohemorrhagic Escherichia coli O157:H7 derived from Sakai outbreak. DNA Res. 1998;5:1–9. doi: 10.1093/dnares/5.1.1. [DOI] [PubMed] [Google Scholar]
- 35.Byrne L, Dallman TJ, Adams N, Mikhail AFW, McCarthy N, et al. Highly Pathogenic Clone of Shiga Toxin-Producing Escherichia coli O157:H7, England and Wales. Emerg Infect Dis. 2018;24:2303–2308. doi: 10.3201/eid2412.180409. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Jenkins C, Dallman TJ, Grant KA. Impact of whole genome sequencing on the investigation of food-borne outbreaks of Shiga toxin-producing Escherichia coli serogroup O157:H7, England, 2013 to 2017. Euro Surveill. 2019;24 doi: 10.2807/1560-7917.ES.2019.24.4.1800346. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Allard MW, Stevens EL, Brown EW. All for one and one for all: the true potential of whole-genome sequencing. Lancet Infect Dis. 2019;19:683–684. doi: 10.1016/S1473-3099(19)30172-0. [DOI] [PubMed] [Google Scholar]
- 38.Herbert LJ, Vali L, Hoyle DV, Innocent G, McKendrick IJ, et al. E. coli O157 on Scottish cattle farms: evidence of local spread and persistence using repeat cross-sectional data. BMC Vet Res. 2014;10:95. doi: 10.1186/1746-6148-10-95. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Cowley LA, Dallman TJ, Fitzgerald S, Irvine N, Rooney PJ, et al. Short-term evolution of Shiga toxin-producing Escherichia coli O157:H7 between two food-borne outbreaks. Microb Genom. 2016;2:e000084. doi: 10.1099/mgen.0.000084. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Greig DR, Mikhail AFW, Dallman TJ, Jenkins C. Analysis Shiga Toxin-Encoding Bacteriophage in Shiga Toxin-Producing Escherichia coli O157:H7 stx2a/stx2c . Front Microbiol. 2020;11:577658. doi: 10.3389/fmicb.2020.577658. [DOI] [PMC free article] [PubMed] [Google Scholar]