Skip to main content
Journal of Clinical Microbiology logoLink to Journal of Clinical Microbiology
. 2019 Jun 25;57(7):e00307-19. doi: 10.1128/JCM.00307-19

Whole-Genome Single-Nucleotide Polymorphism (SNP) Analysis Applied Directly to Stool for Genotyping Shiga Toxin-Producing Escherichia coli: an Advanced Molecular Detection Method for Foodborne Disease Surveillance and Outbreak Tracking

Navjot Singh a,#, Pascal Lapierre a,#, Tammy M Quinlan a, Tanya A Halse a, Samantha Wirth a, Michelle C Dickinson a, Erica Lasek-Nesselquist a, Kimberlee A Musser a,
Editor: Alexander Mellmannb
PMCID: PMC6595464  PMID: 31068414

Whole-genome sequencing (WGS) of pathogens from pure culture provides unparalleled accuracy and comprehensive results at a cost that is advantageous compared with traditional diagnostic methods. Sequencing pathogens directly from a primary clinical specimen would help circumvent the need for culture and, in the process, substantially shorten the time to diagnosis and public health reporting.

KEYWORDS: SNP analysis, STEC, whole genome, outbreak, stool, surveillance studies

ABSTRACT

Whole-genome sequencing (WGS) of pathogens from pure culture provides unparalleled accuracy and comprehensive results at a cost that is advantageous compared with traditional diagnostic methods. Sequencing pathogens directly from a primary clinical specimen would help circumvent the need for culture and, in the process, substantially shorten the time to diagnosis and public health reporting. Unfortunately, this approach poses significant challenges because of the mixture of multiple sequences from a complex fecal biomass. The aim of this project was to develop a proof of concept protocol for the sequencing and genotyping of Shiga toxin-producing Escherichia coli (STEC) directly from stool specimens. We have developed an enrichment protocol that reliably achieves a substantially higher DNA yield belonging to E. coli, which provides adequate next-generation sequencing (NGS) data for downstream bioinformatics analysis. A custom bioinformatics pipeline was created to optimize and remove non-E. coli reads, assess the STEC versus commensal E. coli population in the samples, and build consensus sequences based on population allele frequency distributions. Side-by-side analysis of WGS from paired STEC isolates and matched primary stool specimens reveal that this method can reliably be implemented for many clinical specimens to directly genotype STEC and accurately identify clusters of disease outbreak when no STEC isolate is available for testing.

INTRODUCTION

Shiga toxin-producing Escherichia coli (STEC) is estimated to cause nearly 100,000 illnesses each year in the United States (1). The most common pathogenic serotype E. coli O157:H7 is responsible for more than 70% of these infections annually (2, 3) and can be identified by clinical laboratories (4). Unlike most other E. coli serotypes, E. coli O157:H7 is unable to ferment sorbitol and is easily isolated, as it appears as white colonies on sorbitol-containing plates, such as sorbitol MacConkey (SMAC) agar. However, there are more than 100 serogroups of STEC that are capable of causing illness in humans that cannot be differentiated from normal enteric E. coli flora and are typically not detected by clinical testing (5). Starting in 1996, source tracking to identify and stop foodborne disease has been performed as part of the PulseNet program instituted and funded by the Centers for Disease Control and Prevention (CDC). Since 2013, whole-genome sequencing (WGS) methods have been developed and evaluated, and a transition to these newer WGS methods is under way.

In 2009, the CDC published recommendations regarding STEC testing in the clinical laboratory. These recommendations include culturing for E. coli O157 on selective and nonselective media, as well as testing for non-O157 STEC via a Shiga toxin enzyme immunoassay (EIA) or a molecular assay that detects the Shiga toxin genes (stx1 and stx2) (6). Additionally, it was recommended that positive STEC samples be forwarded to the local or state health department for verification and to obtain a pure culture isolate for characterization, serogrouping, and strain typing. Numerous conventional and real-time PCR molecular assays have been developed to detect stx genes, and in the past few years FDA-cleared tests have also been available (79).

Recent CDC reports examined the use of culture-independent diagnostic tests (CIDTs), an increasingly common method for diagnosing foodborne bacterial infections. This testing does not require isolation and identification of living organisms. Consequently, these tests can be conducted more rapidly and yield results far sooner than can be reached through traditional culturing methods. However, CIDTs do not provide the information needed to characterize the organisms that cause infections. Currently, the bacteria isolated from culture are still needed to identify antibiotic resistance, investigate outbreaks, and monitor foodborne disease trends (10).

The reality of CIDT methods becoming the main diagnostic test employed by clinical microbiology laboratories poses a threat for public health investigations of foodborne disease because molecular subtyping information is lost. Wadsworth Center, New York State Department of Health (WC) began requesting submission of stool specimens and EIA-positive broths by clinical laboratories since 2003 to increase the number of non-O157 STECs available for PulseNet fingerprinting to impact investigations and identify sources of STEC foodborne disease.

As an alternative to culture-based isolation followed by fingerprinting, we investigated a WGS method for the direct typing of STEC from stool specimens during outbreak investigations or when culture is not successful. However, due to the small amount of pathogens in stool DNA, WGS often provides insufficient sequencing data to construct single nucleotide polymorphism (SNP) alignments and accurately identify any potential links with other related strains. In order to improve STEC genome coverage, we adapted an enrichment method that uses whole-genome capture of the pathogen genome using whole-genome RNA baits (11). We generated RNA baits from an STEC isolate to capture STEC directly from stool DNA followed by WGS. These baits showed significant improvement in the yield of E. coli reads in primary stool samples tested and provided a cost-effective option. Since commensal E. coli in the gut has high homology to the STEC genome, we also developed an in-house bioinformatics pipeline to filter STEC reads from the contaminating commensal E. coli reads. We were able to obtain sufficient coverage of the STEC genome in 70% (7 out of 10) of our samples to accurately place them on a phylogeny with representative strains. Our data postcapture indicate that this approach can be used directly on stool DNA without the need for a pure STEC isolate, which can reduce the cost and turnaround time in outbreak tracking.

MATERIALS AND METHODS

Universal quantitative real-time PCR assay.

To estimate the proportion of STEC compared with E. coli in stool samples, a quantitative real-time PCR assay using Primer Express (formerly Applied Biosystems) was developed targeting the fumC gene. The commensal Escherichia coli IAI1 (GenBank accession number NC_011741) served as a reference for primer design. The final primer and probe set (forward primer, 5′-CGCGTCGCGTAGCAGAT-3′; reverse primer, 5′-TTTGTTCGGCGCGGTAA-3′; and fluorescently labeled probe 5′-/5Cy5/CTGGCAGTC/TAO/ATTACCTGTGCACCGTTT/3IAbRQSp/-3′) showed high sequence similarity among E. coli and Shigella spp., but not other enteric microorganisms. PCR amplification using PerfeCTa multiplex qPCR ToughMix (Quantabio, Beverly, MA) occurred under the following conditions: 95°C for 3 min, 95°C for 15 s, and 60°C for 45 s for 40 cycles. After developing this assay for fumC, it was combined with a previously developed assay for the detection of stx1 and stx2 genes from stool DNA (12). The CT values of stx1/stx2 and fumC assays were compared and the difference of stxfumC was used as an estimate to assess the relative quantities of STEC to all E. coli in the primary stool sample. Cell lysates from Campylobacter coli (97-2296), Yersinia enterocolitica (ATCC 9610), Salmonella enterica subsp. enterica (ATCC 13076), Shigella flexneri (ATCC 12022), Escherichia coli O103:H8 (ATCC 23982), Escherichia coli O111:H8 (ATCC BAA-181), and Escherichia coli O6 (ATCC 25922) were used for initial specificity testing.

Bacterial strains and DNA isolation.

The stool mini kit (Qiagen, Hilden, Germany) and QIAcube automated DNA extractor isolated DNA from 10 previously detected Shiga toxin-positive stools and a culture of the Escherichia coli O111:H8 (ATCC BAA-181) strain using the Qiagen stool pathogen detection protocol with a starting volume of 200-μl stool and an elution volume of 200 μl. These stools were chosen for testing the enrichment process by whole-genome baiting because they also had colonies successfully isolated and sequenced by whole-genome sequencing.

Whole-genome RNA baiting.

A flowchart of the whole-genome RNA baiting of stool samples is shown in Fig. 1. Five hundred nanograms of DNA from Escherichia coli O111:H8 BAA-181 was used to generate whole-genome RNA baits as described in A. Melnikov et al. (11), with some modifications. DNA was sonicated using a Covaris M220 instrument at settings of peak power of 50 W, duty factor of 20%, and cycles/burst of 200 in a 130-μl snap-cap microtube to about 300 bp. Sonicated DNA was end repaired using the Quick blunting kit (New England BioLabs [NEB], Ipswich, MA) for 1 h at room temperature. The blunting reaction and all the subsequent reactions were purified by AmpureXP beads (Beckman Coulter, Brea, CA). The DNA was A-tailed using the Klenow exoenzyme (NEB) at 37°C for 1 h and purified. The double-stranded adaptor was generated by annealing an equimolar concentration of both oligonucleotides on a thermocycler by heating the sample at 95°C for 5 min. The thermocycler was turned off immediately after the heating step, and the adaptor was cooled to room temperature for 1 h. The aliquots of the adaptor were stored at −20°C for future use. The ligation reaction was performed with the A-tailed DNA and adaptor using quick ligase (NEB) at room temperature on a thermomixer (Eppendorf, Hamburg, Germany) at 1,000 rpm for 3 h. The ligation reaction was purified and amplified using Kapa HiFi HotStart ReadyMix (Kapa Biosystems, Wilmington, MA). The PCR conditions were 98°C for 2 min, 98°C for 15 sec, 55°C for 30 sec, and 72°C for 30 sec for 12 cycles. The PCR product was purified by AmpureXP beads and amplified for another 12 cycles using conditions described in the first PCR except that a primer that added a T7 promoter was used. The final PCR was purified, quantified by Qubit fluorometric quantification (Thermo Fisher Scientific, Waltham, MA), and used as a template for in vitro transcription. The whole-genome RNA baits generated from 1 μg of the PCR product can generate 25 to 30 hybridization reactions (∼3 μg bait/reaction). These baits can be stored in small aliquots at −80°C and are stable at least for 1 year.

FIG 1.

FIG 1

Schematic representation of the whole-genome RNA baiting of stool specimens. The RNA bait library was generated from genomic DNA of E. coli O111:H8 BAA-181 strain. The DNA was fragmented, end repaired, A-tailed, and ligated to adaptors. The T7 promoter was added to the amplified PCR product and in vitro transcribed using biotin-labeled UTP. The RNA was then hybridized to the denatured stool library, isolated with streptavidin beads, and sequenced using an Illumina MiSeq instrument.

In vitro transcription.

The final PCR product was used as a template for in vitro transcription using a MEGAshortscript kit (Ambion, Carlsbad, CA). A 40-μl reaction was set up as follows: ∼1-μg PCR product, 4-μl reaction buffer (10×), 4-μl ATP solution (75 mM), 4 μl of CTP solution (75 mM), 4 μl of GTP solution (75 mM), 2 μl of UTP solution (75 mM), 2-μl Bio-16-UTP solution (10 mM), 4 μl of T7 enzyme mix, and RNase free water. The reaction mixture was incubated at 37°C for 3 h on a thermomixer at 1,000 rpm. The reaction was DNase treated with 3-μl (2 U/μl) TURBO DNase at 37°C for 30 min. The RNA was purified using acid phenol-chloroform followed by ethanol precipitation. RNA was resuspended in RNase-free water, quantified, and stored in small aliquots at −80°C.

NGS library prep.

The NGS library from pure STEC stool isolates was prepared using Nextera XT kit (Illumina) per the manufacturer’s protocol. For capture by baiting directly from stool, about 100- to 500-ng stool DNA was used to generate an NGS library. The DNA was sonicated to 300 bp, end repaired, and A-tailed as above except the NEXTflex DNA-seq barcode adaptor (Bioo Scientific, Austin, TX) was ligated to the DNA. The ligation reaction was purified by AmpureXP beads and amplified using the Kapa HiFi HotStart ReadyMix (Kapa Biosystems) and NEXTflex primer mix. The PCR conditions were 98°C for 2 min, 98°C for 15 s, and 60°C for 30 s and 72°C for 30 sec for 20 cycles. This NGS library was size selected by AmpureXP beads and quantified using a Qubit fluorometer. For stool DNA with a <100-ng concentration, the Illumina library was prepared using the Nextera XT kit per the manufacturer’s protocol, with a modification of PCR cycles to 20. The total concentration of the NGS library can vary from 50 ng to 2,000 ng. A small aliquot of these libraries was sequenced on the Illumina MiSeq platform as a prebaited control. The rest of the stool libraries were hybridized to the biotin-labeled RNA for enrichment.

Hybridization.

The NGS stool library was resuspended at a volume of 50 μl with dH2O and denatured at 95°C for 5 min, followed by a hold at 65°C on a thermocycler. In parallel ∼3 μg of labeled RNA bait was added to 50 μl of Rapid-hyb buffer (GE Healthcare, Pittsburgh, PA) and heated to 65°C. Both samples were mixed and hybridized at 65°C overnight. The next day, a 50-μl (0.5 mg) aliquot of Dynabeads M-280 streptavidin (Invitrogen, Green Island, NY) was washed two times with the 1× binding/wash buffer and resuspended in 2× binding/wash buffer (10 mM Tris HCl [pH 7.5], 1 mM EDTA, and 2 M NaCl). The hybridization reaction was then added to the resuspended beads followed by incubation for 3 h at room temperature while mixing at 1,350 rpm. The beads were washed 3 times with 1× binding/wash buffer at room temperature, 3 times with 1× SSC (1× SSC is 0.15 M NaCl plus 0.015 M sodium citrate) and 0.1% SDS at 65°C, and 3 times with 0.1× SSC and 0.1% SDS at 65°C for 10 min each on the thermomixer. The DNA from beads was eluted by adding 100 μl of elution buffer (1× Tris-EDTA [TE] and 1% SDS) and boiling the reaction at 95°C for 10 min. The eluted DNA was purified by AmpureXP beads and quantified using a Qubit Fluorometer.

Enrichment PCR.

The library was amplified for an additional 8 to 15 cycles based on the DNA concentration postcapture. Kapa HiFi HotStart ReadyMix (Kapa Biosystems) and the respective primers were used during library amplification under the following PCR conditions: 98°C for 2 min, 98°C for 15 s, 60°C for 30 s, and 72°C for 30 s. The PCR product was purified by AmpureXP beads and quantified using a Qubit fluorometer.

Sequencing and bioinformatics.

Each stool sample NGS library from preenrichment and postenrichment samples and the isolated strain was sequenced using Illumina MiSeq 2 × 250-cycle paired-end sequencing. The paired-end reads from the samples were classified by Kraken version 0.10.5-beta (13). The prebuilt Minikraken 20171019_8GB database was used for the reference sequence database. The Kraken output served as input in Krona tools (14) to generate the taxonomic assessment of the stool samples.

The E. coli O26:H11 str. 11368 genome (GenBank accession number AP010953) served as the reference sequence for bioinformatics analysis. This reference genome was selected because is it a closely related high-quality genome assembly and previous analyses in our laboratory using this reference have provided great accuracy to characterize NGS data from STEC isolates (not published). The RNA baiting method efficiency was determined by assessing the percentage of the raw reads mapped over the E. coli reference genome for each sample before and after baiting. The reads were then filtered to discard PhiX contaminants using BBduk (http://jgi.doe.gov/data-and-tools/bb-tools/) and trimmed for Nextera adaptor sequences using Trimmomatic (15). The taxonomic affiliation of each of the paired reads was determined with Kraken using a local database consisting of a collection of genomes from 62 commonly known human gut bacterial genera (up to 100 randomly selected genomes per genera) (see Table S1 in the supplemental material). Reads not identified by Kraken as Escherichia or Shigella members were discarded. The retained reads were mapped over the reference genome using BWA-MEM (16). The human content in samples was determined using BBmap (http://jgi.doe.gov/data-and-tools/bb-tools/).

To remove genomic regions with ambiguous allele frequencies, a dynamic masking step was performed for each sample by compiling allele frequencies (ignoring positions with no variations) and genomic positions, as determined using SAMtools Mpileup (17), with a minimum quality of Q20 for mapping and base calling. Genomic regions where allele frequencies deviate from the overall average allele frequencies by more than two standard deviations and five or more positions with anomalous frequencies present within a 1,000-bp window were masked to remove noise/background from the data. To assess the E. coli population mixture in the sample, the remaining allele frequencies were used to build an allele frequency distribution and perform a multi-Gaussian curve fitting in R using the mclust package (Fig. 2) (18). The optimal number of components (n) of the fitted functions, i.e., optimal estimated number of discrete E. coli populations, was determined by performing likelihood ratio tests between different numbers of component levels. The averages and standard deviations of the fitted components corresponding to the allele frequencies of each population were used to build individual consensus sequences. Any genomic positions outside the allele frequency standard deviation, as well as any position with a read depth lower than 40× and quality score (QUAL) of 100, as determined by BCFtools, were assigned an unknown state (N) in the final consensus sequences. A phylogenetic tree was generated with FastTree (19) under a general time reversible (GTR) substitution model and 20 gamma categories from a consensus sequence SNP alignment that included known STEC strains.

FIG 2.

FIG 2

Whole-genome RNA baiting efficiency as determined by percentage of mapped reads. Unfiltered raw reads were mapped on the E. coli O26:H11 str. 11368 genome using BWA-MEM and percent mapped reads extracted with SAMtools. Baited samples show a consistent increase of percent mapped reads compared with the unbaited sample. The baiting efficiencies follow a logarithmic function showing that baits are more efficient when a lower concentration of E. coli DNA is present in the samples.

RESULTS

Determine STEC versus E. coli quantities in stool.

The cycle threshold (CT) values of stx1/stx2 (stx) and fumC targets in the assay were compared to estimate the relative abundance of STEC versus total E. coli load in the samples (Table 1). We determined that an stx-fumC difference of <3 CT values indicates a sufficient amount of STEC cells in the sample and a higher SNP calling accuracy. Any difference higher than 3 indicates the STEC load versus commensal E. coli is not high enough to warrant NGS sequencing. However, as stx phages can exist as free phages in stool samples, this quantitative analysis may not be precise and only provides an indication of potential success.

TABLE 1.

STEC population estimation and SNP comparisons results

Sample stx (CT) fumC (CT) Difference of stxfumC (CT) Estimated STEC population (%) No. of SNPs vs isolate Assessed genomea after masking/filtering (%)
Stool-1 22.23 21.49 0.74 >95 0 1.44
Stool-2 23.42 21.32 2.1 84 0 40.56
Stool-3 29.33 24.28 5.05 9 1,298 43.10
Stool-4 23.46 21.97 1.49 >95 1 42.87
Stool-5 32.98 20.77 12.21 6 19,061 48.22
Stool-6 23.5 23.16 0.34 >95 0 59.89
Stool-7 25.25 20.38 4.87 8 896 57.86
Stool-8 22.15 20.64 1.51 86 0 61.76
Stool-9 22.95 20.93 2.02 >95 0 68.31
Stool-10 26.7 24.86 1.84 >95 0 68.17
a

Out of 5,697,340 nucleotides.

STEC enrichment by whole-genome RNA baiting.

The enrichment efficiency of the RNA baiting protocol was determined by comparing the percentage of raw mapped reads over the E. coli O26:H11 as a reference genome, before and after baiting of DNA libraries (Table 2, Fig. 2). The percentage of reads mapping to the E. coli reference recovered from the unbaited samples varied from 2.57% to 74.78% (26.3% on average), resulting for many cases in insufficient depth of coverage to perform bioinformatics analysis. Our analysis shows that the RNA baiting protocol consistently increased the overall E. coli DNA content in the stool specimens (Table 2). The baiting efficiency follows a logarithmic model that shows higher enrichment efficiencies when a lower proportion of E. coli DNA was present in the specimens and starts to plateau when the DNA extract already contains around 40% or higher of the total DNA belonging to E. coli in the sample. The most enrichment was noted in stool-1, with 2.57% and 22.91% of reads mapping to E. coli before and after baiting, respectively, an approximately 10-fold enrichment (Table 2). The resulting average genome depth of coverage after baiting ranged from 35× to 76×, with genome coverages of 78% or higher before filtering for low-coverage regions or locations with ambiguous allele frequencies. These metrics were high enough to conduct all the necessary downstream bioinformatics analyses. Although the baits were generated using the E. coli O111:H8 STEC strain (see Materials and Methods), we were able to enrich for O45, O111, O103, O121, and O26 serotypes of STEC. The differential genomic content between the STEC strains and the reference strain used to map the reads or generate the baits may also contribute in lowering the overall genome coverage recovered for the analyses.

TABLE 2.

Enrichment efficiency of the RNA baiting protocol

Sample Serotype Mapped reads of: (%)
Depth of coverage (×)
Reads mapped to human of: (%)
Unbaited Baited Unbaited Baited Unbaited Baited
Stool-1 O26 2.57 22.91 0.5 35.48 27.25 18.85
Stool-2 O111 9.4 47.68 6.05 47.53 0.00 0.00
Stool-3 O45 9.59 46.35 3.15 44.06 0.03 0.01
Stool-4 O103 13.93 50.46 4.35 42.78 0.15 0.05
Stool-5 O103 18.65 53.79 17.08 45.8 0.00 0.00
Stool-6 O111 20.79 52.27 5.64 57.01 0.01 0.00
Stool-7 O103 26.2 66.87 7.53 66 0.03 0.00
Stool-8 O26 38.95 69.27 29.21 62.7 0.00 0.00
Stool-9 O103 53.54 77.94 16.05 75.88 0.42 0.13
Stool-10 O121 74.78 87.65 13.43 76.05 5.42 1.66
Stool-11a NA 21.69 62.46 8.59 61.63 0.56 0.16
a

Non-STEC sample.

The taxonomic content of all samples was assessed with Kraken (bacterial and viral loads) and BBMap (human loads). A representation of the Kraken output using the minikraken database for the stool-3 sample preenrichment and postenrichment is shown in Fig. 3A. The E. coli proportion improved from 16% to 55% after enrichment, and the proportion of other organisms in this sample was reduced, e.g., Bacteroides ovatus from 16% to 7% postenrichment, indicating the RNA baits preferentially bound to regions of homology in the E. coli genome. In addition to E. coli, most of the samples analyzed contained a mixture of Bacteroides, Klebsiella, Citrobacter, and Faecalibacterium members, among others (Fig. 3A and B). Only stool-1 and stool-10 showed substantial amounts of human reads (Table 1), consistent with another published report (20). After removal of all reads not identified as Escherichia/Shigella spp., we performed a dynamic genome masking step in order to remove any portion of the mapped genome that showed unusual distribution of allele frequencies. It is necessary to perform the dynamic masking individually for each sample because the genomic regions requiring masking vary between samples (these regions included transposons and other repetitive regions that show poor alignment statistics). This step also facilitated the generation of highly accurate allele frequency distributions for the estimation of the number of E. coli populations in each sample.

FIG 3.

FIG 3

(A) Taxonomic assessment of the stool samples. Krona plot showing Kraken classification of stool sample 3 DNA, preenriched and postenriched using whole-genome RNA baits. Kraken was run using the prebuilt minikraken_20171019_8GB database. The unclassified and human reads were ignored and only the classified reads were plotted. (B) Taxonomic content assessment of the samples prebaiting and postbaiting. The taxonomic affiliation of each of the paired reads was determined with Kraken using a local database consisting of a collection of genomes from 62 commonly known human gut bacterial genera (see Table S3 in the supplemental material). The frequency of reads matching human in samples was determined using BBmap. The “Other” category includes all the remaining reads not matching any of the major genera and the “Unclassified” category represents reads with undetermined taxonomic affiliation.

An allele frequency distribution from all genomic positions with allele frequencies below 1 was compiled and is shown in Fig. 4A. Multi-Gaussian functions were fitted over the allele frequency distributions in R to evaluate the population mixture of E. coli in the samples (Fig. 4B). All samples analyzed in this study were found to contain only one or two E. coli populations based on the multi-Gaussian fitting functions. Any sample found to contain more than two E. coli populations would have been discarded due to the difficulty associated with deconvoluting three or more alleles in a sample. In 7 out of 10 samples, the STEC population of E. coli was high enough compared with the commensal E. coli population, predicted by the difference in CT value of <3 (stxfumC) and allele frequencies (Table 1), for successful whole-genome analysis.

FIG 4.

FIG 4

Allele frequency estimation and consensus sequence building. (A) All allele frequencies of <1 over the mapped genomes were compiled and used to build an allele frequency distribution. (B) Multi-Gaussian curve fitting was performed in R to assess the E. coli population mixture in the sample. The optimal number of components (n) of the fitted functions was determined by performing likelihood ratio tests between the different component levels. Any samples with more than three E. coli populations (n > 3) or with problematic distributions were discarded from the analysis.

The consensus sequences from the seven stools samples where the STEC population represented the majority of the E. coli population showed no differences compared with the corresponding isolate sequences (except for stool-4, which had 1 SNP difference). As expected from the SNP results, the seven samples clustered correctly with their expected isolate sequence when placed on a phylogenetic tree (Fig. 5). The three samples where the estimated STEC populations were lower than or equal to 9% (stool samples 3, 5, and 7), showed numerous SNP differences with sequences from their pure counterparts, most of which were heterozygous sites misidentified as homozygous (Table 1). Regardless, stool samples 3 and 7 still fell within the same clades as their respective isolate sequences with strong bootstrap support. However, stool sample 5 failed to cluster with its pure-culture counterpart, with 19,061 SNPs between them.

FIG 5.

FIG 5

Phylogenetic tree of the STEC stool samples and matched STEC isolates. Phylogenetic tree depicting the relationship between the STEC samples sequenced directly from stool samples and their pure sample isolate sequences. The numbers in parentheses correspond to the estimated fraction of STEC versus commensal E. coli in the samples. When STEC composed the majority population of a sample, we found no SNP differences with their isolate sequences, except for stool sample 4 which has 1 SNP difference with isolate sample 4.

The final genome coverage after filtering ranged from 1% to 68% (Table 1), which correlates with the depth of coverage of 35× to 76× (Table 2), respectively. Most of the pure isolates were sequenced at 50× or higher sequencing depth (data not shown).

DISCUSSION

WGS is transforming the field of clinical laboratories due to its ability to provide diagnosis in just a few days, which can significantly reduce associated costs. However, its application to primary clinical specimens, such as stool, is challenging due to the small amount of pathogen DNA, relative to other bacterial DNA in the sample. To circumvent this obstacle, STEC whole-genome RNA baits were generated to pull down E. coli/STEC DNA prior to WGS from stool library. The enrichment method has been consistently reproducible (data not shown). The overall baiting enrichment efficiencies varied from ∼1.17 times to 8.9 times more E. coli sequencing reads present in the postbaiting samples. The higher the E. coli content was in the prebaiting samples, the less efficient the enrichment was, probably a consequence of the already high load of E. coli saturating the availability of baits in the sample. As previously shown, a significant number of human reads were detected in only two samples, showing that the human DNA content at the end of the gastrointestinal track can be low compared with the commensal bacterial DNA. While only a limited fraction of the total genome content can be assessed with high confidence using this baiting method, the recovered genome fraction is enough to determine, with good accuracy, if two strains can be related to an outgoing outbreak investigation. Based on our data, we think that an STEC population representing at least 80% of the total population in a sample and a genome coverage at or above 40% are sufficient for correct sample classification.

Surprisingly, of the 10 STEC-positive samples tested, 7 contained a majority of STEC reads compared with commensal E. coli. It is possible that in most cases of food poisoning from STEC, the strain has the ability to quickly multiply and colonize the GI tract at the expense of the commensal population. When STEC reads are in the majority, it contributes to an accurate bioinformatic analysis in most cases. However, when the STEC was only a minor fraction of the total E. coli population in the samples, the bioinformatic analysis often failed to correctly identify heterozygous genomic positions corresponding to the minor allele population.

For this reason, we believe that the difference in CT values of stxfumC of <3 should be used to determine whether sequencing from a primary specimen will be successful or whether traditional culturing methods should be explored instead. In cases of patients with polyclonal STEC infection, unless one of the STEC population is clearly dominant over the other STEC strain, the pipeline will most likely fail to correctly determine which alleles belong to which strain.

In this study, we demonstrated that RNA baiting enrichment and bioinformatic analysis can reliably be applied to clinical specimens to directly genotype STEC and accurately identify clusters of disease or outbreaks when no STEC isolate is available for testing. While the goal of this project was not to publicly release a finished pipeline, we hope that the basis of our findings and approaches will open the door for improvement and future development of similar types of analyses. Additionally, as with other whole-genome approaches, there is also the possibility to further characterize the STEC in the sample to assess serotype, toxin, and virulence-associated genes. Approaches to extract whole-genome sequence data from primary specimens will continue to improve and provide rapid characterization of pathogens for patient management and enhance outbreak investigation to impact public health.

Supplementary Material

Supplemental file 1
JCM.00307-19-s0001.pdf (15.4KB, pdf)

ACKNOWLEDGMENTS

We acknowledge William Wolfgang for a critical review of the manuscript and for oversight of culture isolate WGS data utilized for comparison in this study. We also thank the Wadsworth Center Applied Genomic Technologies Core for performing WGS.

This work was supported by Cooperative Agreement number NU50CK000423, funded by the Centers for Disease Control and Prevention. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the Centers for Disease Control and Prevention or the Department of Health and Human Services.

Footnotes

Supplemental material for this article may be found at https://doi.org/10.1128/JCM.00307-19.

REFERENCES

  • 1.Mead PS, Slutsker L, Dietz V, McCaig LF, Bresee JS, Shapiro C, Griffin PM, Tauxe RV. 1999. Food-related illness and death in the United States. Emerg Infect Dis 5:607–625. doi: 10.3201/eid0505.990502. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Centers for Disease Control and Prevention. 2009. Preliminary FoodNet data on the incidence of infection with pathogens transmitted commonly through food—10 states, 2008. MMWR Morb Mortal Wkly Rep 58:333–337. [PubMed] [Google Scholar]
  • 3.Omer MK, Alvarez-Ordonez A, Prieto M, Skjerve E, Asehun T, Alvseike OA. 2018. A systematic review of bacterial foodborne outbreaks related to red meat and meat products. Foodborne Pathog Dis 15:598–611. doi: 10.1089/fpd.2017.2393. [DOI] [PubMed] [Google Scholar]
  • 4.Schmidt H, Scheef J, Huppertz HI, Frosch M, Karch H. 1999. Escherichia coli O157:H7 and O157:H(-) strains that do not produce Shiga toxin: phenotypic and genetic characterization of isolates associated with diarrhea and hemolytic-uremic syndrome. J Clin Microbiol 37:3491–3496. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Hughes JM, Wilson ME, Johnson KE, Thorpe CM, Sears CL. 2006. The emerging clinical importance of non-O157 Shiga toxin-producing Escherichia coli. Clin Infect Dis 43:1587–1595. doi: 10.1086/509573. [DOI] [PubMed] [Google Scholar]
  • 6.Gould LH, Bopp C, Strockbine N, Atkinson R, Baselski V, Body B, Carey R, Crandall C, Hurd S, Kaplan R, Neill M, Shea S, Somsel P, Tobin-D'Angelo M, Griffin PM, Gerner-Smidt P, Centers for Disease C, Prevention . 2009. Recommendations for diagnosis of Shiga toxin–producing Escherichia coli infections by clinical laboratories. MMWR Recomm Rep 58:1–14. [PubMed] [Google Scholar]
  • 7.Piralla A, Lunghi G, Ardissino G, Girello A, Premoli M, Bava E, Arghittu M, Colombo MR, Cognetto A, Bono P, Campanini G, Marone P, Baldanti F. 2017. FilmArray GI panel performance for the diagnosis of acute gastroenteritis or hemorragic diarrhea. BMC Microbiol 17:111. doi: 10.1186/s12866-017-1018-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Gray J, Coupland LJ. 2014. The increasing application of multiplex nucleic acid detection tests to the diagnosis of syndromic infections. Epidemiol Infect 142:1–11. doi: 10.1017/S0950268813002367. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Khare R, Espy MJ, Cebelinski E, Boxrud D, Sloan LM, Cunningham SA, Pritt BS, Patel R, Binnicker MJ. 2014. Comparative evaluation of two commercial multiplex panels for detection of gastrointestinal pathogens by use of clinical stool specimens. J Clin Microbiol 52:3667–3673. doi: 10.1128/JCM.01637-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Huang JY, Henao OL, Griffin PM, Vugia DJ, Cronquist AB, Hurd S, Tobin-D'Angelo M, Ryan P, Smith K, Lathrop S, Zansky S, Cieslak PR, Dunn J, Holt KG, Wolpert BJ, Patrick ME. 2016. Infection with pathogens transmitted commonly through food and the effect of increasing use of culture-independent diagnostic tests on surveillance–foodborne diseases active surveillance network, 10 U.S. sites, 2012–2015. MMWR Morb Mortal Wkly Rep 65:368–371. doi: 10.15585/mmwr.mm6514a2. [DOI] [PubMed] [Google Scholar]
  • 11.Melnikov A, Galinsky K, Rogov P, Fennell T, Van Tyne D, Russ C, Daniels R, Barnes KG, Bochicchio J, Ndiaye D, Sene PD, Wirth DF, Nusbaum C, Volkman SK, Birren BW, Gnirke A, Neafsey DE. 2011. Hybrid selection for sequencing pathogen genomes from clinical samples. Genome Biol 12:R73. doi: 10.1186/gb-2011-12-8-r73. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Mingle LA, Garcia DL, Root TP, Halse TA, Quinlan TM, Armstrong LR, Chiefari AK, Schoonmaker-Bopp DJ, Dumas NB, Limberger RJ, Musser KA. 2012. Enhanced identification and characterization of non-O157 Shiga toxin-producing Escherichia coli: a six-year study. Foodborne Pathog Dis 9:1028–1036. doi: 10.1089/fpd.2012.1202. [DOI] [PubMed] [Google Scholar]
  • 13.Wood DE, Salzberg SL. 2014. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 15:R46. doi: 10.1186/gb-2014-15-3-r46. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Ondov BD, Bergman NH, Phillippy AM. 2011. Interactive metagenomic visualization in a Web browser. BMC Bioinformatics 12:385. doi: 10.1186/1471-2105-12-385. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Bolger AM, Lohse M, Usadel B. 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120. doi: 10.1093/bioinformatics/btu170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Li H. 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv arXiv:13033997v2 [q-bio.GN] https://arxiv.org/abs/1303.3997. [Google Scholar]
  • 17.Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup . 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Scrucca L, Fop M, Murphy TB, Raftery AE. 2016. mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J 8:205–233. [PMC free article] [PubMed] [Google Scholar]
  • 19.Price MN, Dehal PS, Arkin AP. 2009. FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol Biol Evol 26:1641–1650. doi: 10.1093/molbev/msp077. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Vincent C, Mehrotra S, Loo VG, Dewar K, Manges AR. 2015. Excretion of host DNA in feces is associated with risk of Clostridium difficile infection. J Immunol Res 2015:246203. doi: 10.1155/2015/246203. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental file 1
JCM.00307-19-s0001.pdf (15.4KB, pdf)

Articles from Journal of Clinical Microbiology are provided here courtesy of American Society for Microbiology (ASM)

RESOURCES