ABSTRACT
Diagnostic testing for foodborne pathogens relies on culture-based techniques that are not rapid enough for real-time disease surveillance and do not give a quantitative picture of pathogen abundance or the response of the natural microbiome. Powerful sequence-based culture-independent approaches, such as shotgun metagenomics, could sidestep these limitations and potentially reveal a pathogen-specific signature on the microbiome that would have implications not only for diagnostics but also for better understanding disease progression and pathogen ecology. However, metagenomics have not yet been validated for foodborne pathogen detection. Toward closing these gaps, we applied shotgun metagenomics to stool samples collected from two geographically isolated (Alabama and Colorado) foodborne outbreaks, where the etiologic agents were identified by culture-dependent methods as distinct strains of Salmonella enterica subsp. enterica serovar Heidelberg. Metagenomic investigations were consistent with the culture-based findings and revealed, in addition, the in situ abundance and level of intrapopulation diversity of the pathogen, the possibility of coinfections with Staphylococcus aureus, overgrowth of commensal Escherichia coli, and significant shifts in the gut microbiome during infection relative to reference healthy samples. Additionally, we designed our bioinformatics pipeline to deal with several challenges associated with the analysis of clinical samples, such as the high frequency of coeluting human DNA sequences and assessment of the virulence potential of pathogens. Comparisons of these results to those of other studies revealed that in several, but not all, cases of diarrheal outbreaks, the disease and healthy states of the gut microbial community might be distinguishable, opening new possibilities for diagnostics.
IMPORTANCE Diagnostic testing for enteric pathogens has relied for decades on culture-based techniques, but a total of 38.4 million cases of foodborne illness per year cannot be attributed to specific causes. This study describes new culture-independent metagenomic approaches and the associated bioinformatics pipeline to detect and type the causative agents of microbial disease with unprecedented accuracy, opening new possibilities for the future development of health technologies and diagnostics. Our tools and approaches should be applicable to other microbial diseases in addition to foodborne diarrhea.
KEYWORDS: Salmonella, diagnostics, diarrhea, human gut, metagenomics
INTRODUCTION
Diagnostic testing for enteric pathogens has relied for decades on culture-based techniques, and culture-derived isolates currently form the foundation for public health surveillance. However, the clinical landscape is rapidly changing as culture-independent diagnostic tests (CIDTs), such as antigen detection and PCR assays, are gaining ground in the clinical arena (1). CIDTs offer several advantages to the clinician, such as lower costs and more rapid test results, although sensitivity may be limited because CIDTs are typically based on information from a taxonomically narrow set of strains. Further, a substantial fraction of the gut microbial community, including opportunistic and rare pathogenic microbes, remains uncultivated and hence undetectable by current culture-based tests (2–4). Accordingly, a total of 38.4 million cases of foodborne illness per year cannot be attributed to specific causes, and the proportion caused by yet-to-be-described microbial agents is unknown (5). Sequence-based CIDTs, such as metagenomics, could provide the means for characterizing these unknown agents, including those that are uncultured for the purpose of diagnostics, and perhaps even subtyping at a level sufficient for outbreak investigations. In addition, sequence-based approaches are more reliable and robust than antigen- or PCR-based approaches due to the inherent technical limitations of the PCR-based techniques, such as amplification biases and limited primer specificity for strain typing (6), and they can detect cases of coinfection by multiple pathogens, which are often missed due to the traditional focus on single tests or isolate colonies. Therefore, the development and validation of sequence-based CIDTs hold great potential for facilitating epidemiological investigations of enteric infections and disease surveillance.
Enteric foodborne diarrhea is an ideal case to develop metagenome-based CIDTs because the agent frequently remains unidentified either due to technical challenges (e.g., uncultivated organisms) or the fact that several diarrheal infections quickly self-resolve. Furthermore, little is known about the effects of enteropathogens on the normal gut microbial community and whether invasive pathogens produce similar or distinct alterations in microbial composition and function due to their characteristic virulence factors. It is also not understood how these effects differ across a spectrum of diarrheal severity or relative to noninfectious (osmotic) diarrhea. Advancing these issues could lead to a better fundamental understanding of disease progression and provide critical information for diagnostics, e.g., pathogen-specific signatures of the perturbed gut microbiome. Thus far, only a very limited number of studies have characterized the alterations in the gut microbiome in response to diarrheal diseases, and these were not focused on diagnostics and validation of sequence-based CIDTs, nor were they limited to isolate-based typing (7, 8).
In outbreak detection, it is important to be able to cluster isolates from outbreaks of a common source, distinguish unrelated outbreaks, and exclude isolates of similar subtypes that are recovered in the same time window but are not part of the outbreak. Therefore, any shotgun metagenomics or amplicon sequencing approach used on DNA from disease-state stools in the CIDT era must also be able to yield the same level of resolution obtained from isolate-based sequencing. The added challenge for shotgun metagenomics is it must be able to yield these data against the background noise of a stool microbiome and coeluting human DNA sequences. Therefore, the development of bioinformatics pipelines along with standards for translating the large amount of sequencing data into useful and reproducible information for routine diagnostics, typing, and surveillance, as well as assessment of the virulence factors of pathogens, are equally critical to the advances that have been made in the molecular laboratory.
To provide new insights into the above-mentioned issues, determine the strengths and weaknesses of metagenomics-based CIDT, and establish the best approaches to subtype etiological agents in clinical specimens, we applied whole-genome shotgun (WGS) sequencing to specimens available from two distinct outbreaks and compared the results to the available epidemiological and isolate-based data. This is one of the first examples of application of metagenomics to pathogen detection in acute diarrheal cases, and thus, the results and lessons learned should have implications for the future development of sequence-based diagnostics and subtyping methods for pathogens in enteric diarrheal or other clinical infections.
RESULTS
Foodborne outbreak description and specific research objectives.
In July 2013, the Centers for Disease Control and Prevention (CDC), along with the Colorado Department of Public Health and Environment and the Alabama Department of Public Health, investigated two outbreaks attributed to Salmonella enterica subsp. enterica serovar Heidelberg. The Alabama outbreak affected almost a hundred people at a funeral luncheon, and the Colorado outbreak affected six people at a dinner. Rapid onset of symptoms following exposure was observed in both incidents (as little as 2 to 3 h), with at least a third of the infected people ending up in the hospital for treatment, and at least one fatal case was reported from each outbreak. Culture-based methods showed that patients from both incidents were infected with Salmonella Heidelberg, and Salmonella isolates from both incidents were indistinguishable by pulsed-field gel electrophoresis (PFGE). Because these two outbreaks occurred in the same month with rapid onset of similar symptoms and an identical PFGE pattern, a common source was suspected. However, whole-genome sequencing of these Salmonella isolates showed that the outbreaks were in fact caused by two closely related but distinct genotypes (Fig. 1) (H. Carleton and A. Sabol, personal communication). Thus, the CDC concluded that the two outbreaks were independent of each other. Because primary specimens (stools) were also obtained from both outbreaks, the culture-based results provided the opportunity to assess the efficacy of a culture-independent WGS metagenomics-based approach in addressing the needs of clinical diagnostics and outbreak investigations. Specifically, we asked the following questions: how prevalent is the pathogen signal in the background of a diverse stool microbial community, and can we disentangle the genome of the causative agent for diagnosis? More importantly, can we subtype to the appropriate level to distinguish the two outbreaks? What type and magnitude of community perturbation occur during acute gastroenteritis caused by S. enterica? Last, do we see any evidence of coinfection by other pathogens? The rapid onset of gastroenteritis suggested that other agents may have been present, such as Staphylococcus aureus, e.g., 3 to 6 h for onset of symptoms for S. aureus versus 12 to 24 h for Salmonella, typically (9, 10).
FIG 1.
Phylogenetic relatedness among Salmonella Heidelberg isolate and population bin genomes recovered from the two outbreaks. The phylogenetic tree of Salmonella Heidelberg isolates from outbreak-associated samples (within colored boxes), other unrelated isolates from the same year (remaining sequences, except GenBank accession no. NC_011083, which represents a reference Salmonella Heidelberg strain), and Salmonella population bins recovered from outbreak metagenomic data sets (in red, within boxes) based on the alignment of high-quality SNPs distributed across the genome are shown. The red box denotes a cluster of sequences associated with the Alabama outbreak; the blue box denotes the Colorado one (except for isolate 2013K-1036, which was a sporadic isolate from the same time period and of the same PFGE pattern but unrelated to the CO outbreak). The numbers at each node indicate the bootstrap support from 100 replicates.
Metagenomic yield from disease-state stools.
Metagenomic sequences were obtained for a total of 11 specimens, 6 from Alabama (AL) and 5 from Colorado (CO), with each sample from a different sick individual. The specimens differed in the proportional amount of microbial versus human sequences (no human DNA subtraction was attempted). Recovery of microbial DNA varied between samples, and some extractions contained large amounts of human host DNA, reflecting the effects of acute diarrheal disease on the intestinal epithelium (Table S1). Following the removal of reads containing human sequences, between 0.06 and 5.5 Gb (0.7 to 72%) of sequencing data remained for (microbial) metagenomic analyses (Fig. S1 shows our complete bioinformatics pipeline for all analyses performed). We used estimates of shotgun sequencing coverage to evaluate our metagenomic data sets for potential pathogen recovery using Nonpareil, an assembly- and reference-independent measure of overlap among sequencing reads (11). Although the total microbial sequencing recovery varied between samples, >95% of the microbial population was covered in 7 (63%) of the metagenomes, according to our estimates (Fig. S2), except patient stools 100_Kickstart, 197A_Reinvent, 168_Loopy, and 124_Usual. Previous observations have determined that data sets with >60% average coverage perform well for de novo assembly and data set comparison for the detection of differentially abundant functional genes (11), suggesting that our sequence effort was sufficient for robustly evaluating the microbial community for the majority of the specimens (7 out of 11, at least). Pathogen load remains poorly understood during foodborne infection, and if the etiologic agent is present at very low abundance, it may still escape detection by metagenomics, even in cases of high sequence coverage. However, previous recovery of a Shiga-toxigenic Escherichia coli O104:H4 genome from stool metagenomic data sets observed that the pathogen accounted for a “sizable proportion” of the microbial reads in many samples (7), while the limit of detection of our approach was estimated to be ∼0.001% of total DNA, based on spiking reads from a target reference genome in our metagenomes and requiring an average coverage of ∼0.1× across the genome. Collectively, our results suggest that the amount of sequencing per sample obtained here was adequate for robust epidemiological investigation of stool samples, despite the coelution of large amounts of human DNA, and provided a dynamic range in terms of relative abundance for detecting the causative agent of at least 5 orders of magnitude.
Salmonella detection.
Routine surveillance by the CDC using culture-based techniques, including isolate genome sequencing followed by phylogenetic analysis, identified Salmonella Heidelberg as the likely causative agent of diarrhea. By using a BLAST recruitment method, which measures the coverage depth across an isolate Salmonella genome recovered from the outbreak by those reads matching at a species-specific nucleotide identity (>95%) (12), we corroborated the presence of a Salmonella serovar Heidelberg strain in the metagenomes and noted sufficiently high coverage to support de novo assembly (7.8× to 120.4× coverage depth) (Table S2). To gauge recovery of Salmonella genomic material in an unbiased fashion, we performed metagenomic de novo assembly of each metagenomic data set, using both IBDA-UD and Omega (13, 14), and then extracted those contigs that matched with high identity (>95%) and coverage (>80% of query length) to a reference Salmonella genome. We found a large number of long contigs matching Salmonella, indicative of both the abundance and high quality of the Salmonella reads in these stool libraries (total assembled length, 1.5 Mbp to 4.2 Mb, depending on the data set considered; Table S2).
We then asked if we could recover sufficient variant information from these Salmonella contigs to be able to distinguish the two outbreaks from one another, as well as from unrelated samples. Also we wanted to know if the metagenomic Salmonella contigs were representative of the isolates recovered from the outbreaks, as well as the diversity of the Salmonella population(s) present in the outbreak samples. Due to the anonymization of specimens during the outbreak investigation, we could not directly compare isolates and specimen results. Nonetheless, core-gene phylogeny of contigs and available isolate genome sequences revealed that both metagenomically and isolate-derived Salmonella sequences from the AL outbreak clustered together, with a median intracluster distance of 5 single nucleotide polymorphisms (SNPs) (range, 0 to 39 SNPs) (Fig. 1). Similarly, metagenomically and isolate-derived Salmonella sequences from the CO outbreak clustered together, with a median intracluster distance of 4 SNPs (range, 1 to 11 SNPs). The two outbreak clusters were clearly distinguished from one another, with a median SNP distance of 53 SNPs (range, 50 to 89 SNPs) (P < 0.01, G test). Finally, the outbreak clusters were also clearly distinguishable from unrelated Salmonella sample sequences obtained by the CDC during the same period, showing 54 to 68 SNP differences in a comparison of AL outbreak samples with unrelated samples and 20 to 50 SNP differences in a comparison of CO outbreak samples with unrelated samples.
De novo assembly yields consensus sequences, which may not accurately represent potential strain-level diversity within microbial populations. To evaluate this issue, we performed population binning based on individual samples and subsequent quantification of individual strains within a metagenomic population based on (i) a recently developed algorithm for this purpose, ConStrains (15), and (ii) phylogenetic analysis of raw (unassembled) metagenomic reads mapping on the population bins. ConStrains analysis revealed that it was unlikely that AL samples contained multiple strains of Salmonella (Fig. 2). However, two Salmonella strains likely coexisted in two CO samples (189A-Dizzy2 and 136FXP-Acidic, represented by short branches in Fig. 1 and 2), albeit their core genomes did not diverge enough to affect the consensus sequences and hence, the genome tree. From the core-gene set analysis, we identified only 2 genes, UDP-phosphate galactose phosphotransferase (rfbP) and cysteine sulfinate desulfinase (csdA), that had no gaps in the alignment and more than 1 SNP each (2 SNPs). Phylogenetic reconstruction based on these two genes that included the homologs in the assembled isolates, reference genomes, population genomes (bins), and raw unassembled reads encoding fragments of the genes provided evidence that the two outbreaks were caused by distinct clades and there were no subpopulations or distinct strains within a clade; i.e., reads from data sets from each outbreak had different SNPs (star-like phylogeny; see Fig. S3).
FIG 2.
Gene content comparisons of reference genomes with isolate and de novo assembled population genome bins from Alabama (AL) and Colorado (CO) outbreaks. The phylogenetic trees of E. coli (a) and Salmonella (b) strains shown are based on alignments of concatenated core genes, consisting of 3,847 orthologs shared by every Salmonella genome used in this study. Black circles represent reference genomes, and red and yellow circles represent CO and AL isolates, respectively. Isolates were highlighted with black borders. The color-shaded group genomes into commonly accepted clusters (UPEC, uropathogenic E. coli; APEC, avian-pathogenic E. coli; EHEC, enterohemorrhagic E. coli; commensal, isolates not causing disease in healthy humans and being part of the natural gut flora). The most related Salmonella genomes, i.e., isolates from the outbreaks, 2 Heidelberg reference genomes (SL476 and B182) and 2 de novo assembled population bins from AL and CO, were also clustered based on their gene content; only variable components of the pangenomes are shown for clarity (c). (c) Left, cladogram using Manhattan distance based on gene content presence/absence of the total number of variable orthologous genes (OGs) identified (n = 711); right, subset of the variable genes that made up the genome-specific gene clusters discussed in the text. Numbers in parentheses correspond to the total number of genes that belong to each cluster. (d) Annotation of AL-specific, CO-specific, and Heidelberg-specific OGs is shown by categories in bar plots.
We also analyzed gene content to identify any functions differentially present between isolate genomes from different outbreaks and the population bins recovered from the metagenomes. In total, 3,847 orthologs shared by every Salmonella (isolate or population bin) genome used in this study (core) and 711 variable orthologs, i.e., those absent in at least one genome, were identified, revealing a similar overall gene complement among all the Salmonella genomes (Fig. 2c), given also that several of the population bins were incomplete (Table S3). Clustering based on the presence/absence of the 711 variable genes also separated CO from AL Salmonella isolates and the population genome bins recovered from metagenomes, similar to a core-genome phylogenetic reconstruction. However, AL and CO population genomes (bins) clustered together, presumably due to the fact that they were missing a small set of Salmonella core genes due to incomplete genome recovery during binning, as opposed to real (higher) gene content relatedness.
Six subclusters of orthologous genes were identified among the variable genes: cluster 1 was composed of 514 genes that were absent only in the genome bins. Because it was challenging to access which of these genes were absent due to incomplete genome recovery as opposed to real gene differences, analysis of these genes was not pursued further. Cluster 2 was composed of 54 orthologous present only in the outbreak isolates, cluster 3 had 41 AL-specific orthologous genes, cluster 4 had 4 CO-specific genes, and clusters 5 and 6 had 14 population-specific and 7 Salmonella Heidelberg-specific genes, respectively (Fig. 2c). The majority (58% of total) of the AL-specific genes were related to phage assembly and insertion, but a few AL-specific genes (n = 12) were related to central metabolism, for example, glycoside hydrolase and oxidoreductases. No obvious pathogenicity factor was identified among these variable genes (Fig. 2d). Therefore, variable gene content, in addition to core-gene phylogeny, was consistent with the conclusion that the outbreaks in CO and AL were caused by highly related yet distinct genotypes of Salmonella.
Microbial community composition in disease-state stools.
To assess microbial community composition, 16S rRNA gene fragments were identified among the metagenomic reads, clustered, and taxonomically classified. A comparison of normalized relative abundance at the phylum level indicated that these microbial communities were dominated by Bacteroidetes and Firmicutes, consistent with previous descriptions of the human gut microbiota (Fig. S2). In general, Bacteroidetes were most abundant in the AL samples, while Firmicutes and Proteobacteria were most abundant in the CO samples. However, most samples also included a large proportion of Proteobacteria, indicative of the underlying foodborne Salmonella infections suffered by these patients. An nonmetric multidimensional scaling (NMDS) plot of normalized composition at the genus level indicated that microbial communities from the two outbreaks were distinct, and samples collected from the same outbreak clustered together, in general (Fig. S2 and S4). The outbreak samples were also clearly distinct from a set of 13 representative samples that captured the total diversity of the 300 healthy American samples determined previously by the HMP Consortium (16) based on k-medoid clustering of 16S rRNA gene fragments recovered in the metagenomes. Similar results were obtained with MetaPhlAn2 taxon relative abundance profiles at three different levels (phylum, genus, and species) using taxa with relative abundance more than 0.1% in the samples; i.e., the outbreak samples were clearly separated from the representative healthy human microbiome samples (permutation test, P < 0.01; Fig. 3, left panel). We applied a Kruskal-Wallis test to both the MetaPhlAn2 and HUMAnN2 profiles of the three groups of samples, and it showed that the three groups were strongly significantly different from each other, with P values 1.4 × 10−6 for functions (HUMAnN2) and 1.23 × 10−6 species (MetaPhlAn2), respectively.
FIG 3.
Taxonomic and functional comparison of metagenomes from the Alabama (AL) and Colorado (CO) outbreaks to healthy microbiome ones (HMP). Three hundred healthy HMP samples from North Americans were clustered as described in Materials and Methods and are represented by 13 representative samples/clusters (in red). The PCoA scatter plots show the taxonomic relatedness of the samples using their MetaPhlAn version 2.0 profiles at the genus level (left) and their functional relatedness based on HUMAnN2.0's pathway relative abundance profiles (right).
We further analyzed the functional separation of these samples using HUMAnN2.0's pathway profiles, and a pronounced sample separation was also observed (Fig. 3, right panel). Between AL and CO samples, there was separation, but not as pronounced as that between disease and healthy samples. In both functional analysis and taxonomic principal-coordinate analysis (PCoA), we observed a slight mixing of samples across outbreaks, especially with phylum-level taxonomic profiles. The mixing was mainly driven by the high abundance of E. coli in CO sample 124-Usual, which resembled AL outbreak samples more than the other CO samples. At a functional level, such mixing was less clear, indicating a function-driven assembly of microbes during the outbreaks. The primary factors separating diarrheal samples and healthy HMP samples were E. coli and Salmonella; E. coli was frequently as abundant as, if not more abundant, than Salmonella, depending on the sample considered, and such high abundances of E. coli and Salmonella in the same sample (see, e.g., Fig. S4) were never observed in any of the 300 healthy HMP data sets. Variance analysis revealed that E. coli and Salmonella contributed most toward the separation of outbreak samples from the healthy HMP samples. This led us to more closely examine E. coli as a possible coinfecting pathogen.
Potential coinfection.
Several lines of evidence indicated that the E. coli populations detected were likely not highly virulent but rather commensal populations, favored by the conditions during infection. First, the recovered E. coli population genome bins from CO samples were assigned to a B2 clade, overpopulated by commensal strains, while the AL ones represented different genotypes, scattered across the E. coli phylogenetic tree (Fig. 2a), and did not cluster with any of the known highly virulent clades, such as Shiga toxin-producing O157:H7 (Fig. S5). A single genotype would have been expected based on the epidemiological data (e.g., acute progression of the disease, linked to a specific event) if E. coli were an important coinfectant. Further, although a few known pathogenicity factors, mostly linked to adhesion and iron acquisition, such as fimbrial adhesins, were detected in the recovered E. coli population genomes (Fig. S6), none of the major pathogenicity factors, such as Shiga toxin or hemolysin, were detected in the population genome bins or the assembled (or unassembled) metagenomic sequences (Fig. S3). Although the fimbrial adhesins encoded by the type I fimbrial operon were initially considered to be important colonization factors responsible for intestinal and extraintestinal infections in humans, recent studies have shown that an incomplete version of the operon is commonly found in commensal E. coli strains derived from healthy humans (17). In addition, it seems unlikely that Shiga toxin genes would be undetected due to sequencing gaps or misassembly, given the high coverage of the E. coli population (>10× in some samples; E. coli population bin showing >90% completeness [Table S3]). In contrast, the Salmonella populations were phylogenetically closely related to the known pathogenic strains S. enterica Heidelberg and contained almost the complete gene content of the S. enterica Heidelberg strains, including the effector proteins, such as SspH2, that can subvert host immune responses (18).
Based on the rapid onset of symptoms, we also looked for signs of coinfection with Staphylococcus aureus. One sample (197A) had sufficient coverage to recover 81.4% of the S. aureus genome, and from a second sample (189A), we recovered 5.8% of the S. aureus genome (Fig. S7). Both samples were associated with the Colorado outbreak. The remaining samples showed either no coverage of S. aureus genes or covered nondiagnostic genes, such as mobile elements and hypothetical proteins. S. aureus reads from the 197A or 189A data set mapped to the reference strain NCTC 8325 genome with an average nucleotide identity of 99.81% and covered the full gene content of virulent S. aureus reference strains, including 7 major pathogenicity factors, suggesting that the S. aureus populations were highly clonal and presumably pathogenic. More specifically, the genes encoding the staphylococcal enterotoxin (sek2), the delta-hemolysin toxin (hld), and the protein SCIN, which plays a role in host immune evasion, were all detected, at a coverage similar to that of the rest of the genome (Table S4). Hence, these findings indicated that coinfection with Staphylococcus was likely for at least some patients.
The food associated with the Alabama outbreak did not test positive for preformed Staphylococcus enterotoxin, and attempts to isolate S. aureus from this food did not show growth. No food from the Colorado outbreak was recovered for testing. We also attempted to recover coagulase-positive S. aureus isolates from a superset of 24 stools from both outbreaks. Coagulase-positive S. aureus was recovered from six patients from Alabama and two patients from the Colorado outbreak. A reverse transcription-PCR (RT-PCR) assay was performed for toxin gene identification (19). From the Alabama outbreak, one patient had an isolate with the sed toxin gene, two patients had sec-positive isolates, and the remaining three patient isolates were negative for the toxin genes sea, seb, sec, sed, see, and seh. From the Colorado outbreak, one patient had an isolate containing the sea toxin gene, and the other patient was negative for toxin genes sea, seb, sec, sed, see, and seh. These isolate data suggest that while S. aureus could have been a coinfectant in a couple of the samples, it was not the common cause of these outbreaks.
DISCUSSION
Using a culture-independent shotgun metagenomic approach, we were able to resolve epidemiologically relevant phylogenetic signals from disease-state stool samples obtained from two Salmonella outbreaks and produced results consistent with those from isolate-based sequencing. Both approaches showed that the outbreaks were caused by different Salmonella Heidelberg genotypes that were distinct from the Salmonella Heidelberg isolates with the same PFGE pattern that were collected during the same year. Although the rapid onset of symptoms likely resulted from a higher pathogen load than typically found in disease cases, and hence more in-depth genetic information could be recovered about the pathogen, this study shows that it is feasible to recover sufficient variant data from complex metagenomic samples to obtain results comparable to those from cultured isolates (see, e.g., Fig. 1 and 2). The data also provided additional information that may be important for epidemiological investigations, such as signs of coinfection, gut microbiome shifts resulting from infection, as well as the estimated intrapopulation sequence and gene content diversity for the pathogen. Also important was the detection of abundant likely commensal populations (e.g., E. coli), which could confound traditional laboratory tests if the total gene complement of the population and composition of the microbial community are not known (a common limitation of the traditional laboratory tests; note also that if the virulence factors of the pathogen are not known, the metagenomic analysis will be obviously more limited than what was described here for E. coli and S. aureus). A similar finding was recently reported by Singh and colleagues, where a significant increase in levels of the genus Escherichia within intestinal communities affected by different enteric pathogens, including Campylobacter, Salmonella, Shigella, and Shiga-toxin producing E. coli (0.14% of total relative to 0.01% in uninfected communities), was noted (20). Our study provided additional support for those findings and further suggested that the E. coli population is most likely a commensal lineage that exhibited overgrowth in the intestine of the host during Salmonella infection.
Further, even though metagenomics indicated that coinfection with Staphylococcus might be possible for at least a couple of the patients, the low abundance and sporadic detection of Staphylococcus in our outbreak samples as a whole, and the high relative abundance and clonality of Salmonella within each outbreak, suggested that Salmonella was presumably the primary causative agent, and conditions caused by Salmonella infection may have facilitated the rapid growth of other commensal and potentially pathogenic organisms, such as E. coli and S. aureus. It is important to point out that this level of resolution of the gut microbial community is typically inaccessible by traditional culture-based isolation or PCR-based assays. If S. aureus was the common element in the outbreak, we would have detected, by culture or sequencing, the same S. aureus genotype(s), and in similar abundances, in the human subjects sampled (who all consumed the same food during the gatherings), which was not the case. Consistent with these interpretations, our culturing efforts and RT-PCR assay for the toxin showed that S. aureus (or E. coli) was likely not present in the consumed food.
Much attention has recently been given to the study of the human microbiome during disease, and there is growing evidence to suggest that host disease state is often accompanied by a deviation in microbial community composition. To date, the focus has remained primarily on chronic conditions, such as obesity, inflammatory bowel disease, and Crohn's disease, in which disease-state microbial communities exhibit reduced diversity and modularity and differ in their functional gene content, reflecting altered interaction with host metabolism (21, 22). Few studies have investigated the microbial community response to acute disease, such as enteric infection. An individual healthy gut microbiome remains relatively stable over time, and community shifts have been suggested as possible diagnostic markers (23). Because virulence factors influence disease manifestation, different pathogens may elicit characteristic responses in gut community composition. However, a recent comparison of stool samples from 200 patients surprisingly found no difference in gut microbiome composition associated with infection by four different enteric pathogens (20). Similarly, Pop et al. recently assessed fecal microbiota composition, based on 16S rRNA gene amplicon sequencing, in a cohort of 992 children under 5 years old who had been diagnosed with moderate to severe diarrhea; however, the exact causative agent(s) remained elusive for most of the cases examined (24). The results from the Pop et al. study showed that the main differences observed in the microbiota composition between diarrheal and normal stools are (rather minor) differences in the proportions of the most prevalent taxa, and that disease samples show, in general, slightly lower species diversity, a finding echoed by Singh et al. (20) and earlier reports (see, e.g., reference 25). In contrast to these previous studies, the microbial communities from two isolated foodborne outbreaks analyzed here were distinguishable from 300 healthy HMP samples primarily by their high abundance of members of the Enterobacteriaceae family (see, e.g., Fig. S4), and, to a lesser extent, members of the Bacteroides or Firmicutes, two dominant phyla found in the microbiomes of healthy individuals (56). These findings indicated that different types of diarrhea, e.g., acute versus moderate infection, may be characterized by different gut microbiota signatures, potentially opening new possibilities for the diagnosis and typing of diarrheal infections. Consistent with these interpretations, a recent longitudinal study of gut microbial community recovery observed that preexisting taxa dropped in abundance and new taxa appeared in great abundance following acute diarrheal disease due to Vibrio cholerae and enterotoxigenic Escherichia coli (ETEC) infection, before the preexisting taxa returned in abundance postinfection (26).
It is also possible that the differences noted between our and previous studies might be due, at least in part, to the methods employed by the different studies; e.g., the 16S rRNA gene used in previous studies typically offers limited resolution at the species and subspecies levels due to high sequence conservation (27), while the causative agent was unknown in most of the samples analyzed by Pop and colleagues, limiting further conclusions and interpretations. Alternatively, the differences in CO and AL patient microbial communities compared to healthy HMP samples may also reflect preexisting differences in composition rather than discrete responses to acute disease. Unfortunately, as is typically the case with samples gathered during foodborne infections by public health laboratories, healthy samples from the patients of the outbreak were not available to robustly test the hypothesis of discrete response to acute disease. However, given the unusually high Enterobacteriaceae signal in outbreak samples relative to healthy samples (see, e.g., Fig. S4) and the large sample size of healthy HMP samples included in our analysis, which covers a large portion of the U.S. population and different diets (16), the existence of pathogen-specific signatures in the gut microbiome may appear more likely than the alternative explanation based on preexisting differences in the composition of the microbiome. In any case, more samples and outbreaks, and with well-defined etiology, need to be analyzed before more robust conclusions on the existence of pathogen-specific signatures on the gut microbiome can emerge. Nonetheless, the results presented here created intriguing hypotheses about the effects of acute Salmonella infections on the composition of the gut microbiome.
In summary, our findings revealed the strengths of shotgun metagenomics in uncovering the total genome diversity in a sample and call for further detailed investigations of the agents present in stool and other human clinical specimens, including those associated with well-defined outbreaks. Even though the cost of shotgun metagenomics is currently prohibitive for everyday routine monitoring, the results presented here reveal that metagenomic investigations could provide high-accuracy culture-independent disease diagnosis and subtyping, as well as outbreak mapping in special cases. Shotgun metagenomics-based CIDTs were also faster than the isolate-based investigation in our case (<1 day versus 2 to 3 days required for isolate growth and characterization) and provided resolution of the pathogen and response of the gut microbial community that are not easily attainable by culture-based approaches. Our work also brought into sharper focus several aspects that could be improved in the future toward more robust and routine monitoring of foodborne outbreaks using metagenomics approaches. Most importantly, we learned that more samples, and from defined outbreaks, such as the ones presented here, need to analyzed before more robust conclusions can emerge with respect to what types of diarrhea-causing organisms can be typed or not. This information would be highly relevant for diagnosing cases of unknown etiology as well. Further, there is still a need to determine the best laboratory procedures to obtain high-quality clinical specimens, preserving stool samples from temperatures during shipment that can degrade high-molecular weight DNA and change the relative abundance of taxa, and to remove coeluting human DNA during sample preparation. While human DNA reads can be identified and removed in silico, as performed here, it is a more effective solution to remove human reads in vitro prior to the sequencing reaction. Therefore, the development of rapid effective human sequence removal chemistries will be fundamental to useful shotgun metagenomic sequencing in clinical and public health contexts.
MATERIALS AND METHODS
Stool sample collection.
Raw stool samples from human cases associated with the outbreaks were obtained from both Alabama Public Health Bureau of Clinical Laboratories (6 specimens) and the Colorado Department of Public Health and Environment (5 specimens). The specimens were anonymized according to institutional review board (IRB) regulations. The stool samples were stored at 4°C and shipped to the CDC at −20°C. Upon receipt, all samples were stored at −80°C.
Isolate genome DNA extraction and sequencing.
Salmonella isolates from the Alabama outbreak were isolated from stool specimens using standardized primary isolation procedures by the National Enteric Laboratory Diagnostics and Outbreak Team at CDC, while Salmonella isolates from the Colorado outbreak were isolated from selective media using standardized procedures by the Colorado Department of Public Health and Environment. DNA was extracted from isolated colonies using the QIAamp DNA minikit (Qiagen, Inc., Venlo, the Netherlands). Extracted isolate DNA was prepared for sequencing using the Illumina Nextera XT sample preparation kit (Illumina, Inc., San Diego, CA). Libraries prepared from each sample were then sequenced on an Illumina MiSeq instrument, using one MiSeq reagent version 2 kit (either 300 or 500 cycle), with up to 16 samples multiplexed per sequencing run.
Metagenomic DNA extraction and sequencing.
Raw stools were thawed at room temperature and homogenized using a BeadBeater (BioSpec Products, Inc., Bartlesville, OK) for 3 min. DNA was extracted from a homogenized stool mix using the QIAamp DNA stool minikit (Qiagen, Inc., Limburg, Germany) and evaluated for purity and concentration using a NanoDrop 2000 device (Thermo Scientific, Inc., Waltham, MA) and Agilent QuBit device (Life Technologies, Inc., Carlsbad, CA). Extracted whole stool DNA was then sheared to 450 bp using a Covaris M220 sonicator and prepared for sequencing using the SPRIworks high-throughput sample preparation kit (Beckman Coulter, Inc., Brea, CA) and the NEBNext Ultra DNA library prep kit (New England BioLabs, Inc., Ipswich, MA). Libraries prepared from each sample were then sequenced on an Illumina MiSeq instrument, using one 500-cycle MiSeq reagent version 2 kit for each sample library.
Variant analysis of isolate and metagenomically derived Salmonella genome sequence.
Raw FASTQ reads from both Salmonella isolates and shotgun-sequenced stool were cleaned and trimmed using SolexaQA (28) with default DynamicTrim settings for Illumina and, when longer than 50 bp after trimming. Isolate reads were assembled using SPAdes 3.6.0 (29). Shotgun-sequenced stool reads were assembled using Omega (14). These metagenomically assembled contigs were filtered for those matching Salmonella by a BLAST search against a Salmonella genome recovered from an isolate from the outbreak and selecting those contigs with 95% or greater identity and 85% or greater percent coverage. Assembled isolate genomes and metagenomically assembled Salmonella contigs were then aligned along their core genomes using Parsnp (30). SNP distances identified using Parsnp were then converted into maximum-likelihood trees using RAxML under a general time-reversible model, optimization of substitution rates, and GAMMA model of rate heterogeneity with 1,000 iterations (31). Trees were visualized using FigTree (32).
Metagenomic assembly.
Raw reads were trimmed as described above for genome isolate reads. Thirteen healthy HMP metagenomic data sets were additionally included in this study for comparison purposes and processed in the same way for consistency. Reads identified as human were tagged and removed using BMTagger (33) with the reference human genome GRCh38 (NCBI accession no. GCA_000001405). For all metagenomes (both HMP and outbreak samples), IDBA-UD (13) was employed to assemble each sample with k-mer set to be 31, 33, 35, … 61. Contigs greater than 500 bp from all metagenomic data sets of a single outbreak were coassembled using Newbler 2.0, as previously described (34).
Population genome binning.
Reads were mapped to the contigs assembled from all metagenomic data sets of a single outbreak using Bowtie 2 version 2.2.0 (35), with default settings. The per-sample per-position coverage on coassembled contigs was counted based on the pileup file constructed by SAMtools version 0.1.19 (36) with the mpileup command using the ‘-C50 -d100000 -DSf’ options. The per-base coverage counting was translated into coassembled contig coverage vector over all samples from a particular outbreak location (AL or CO) using the following steps: (i) the 100 bp in either the 3′ or 5′ end was discarded; (ii) for the remaining sequence, 50-bp nonoverlapping windows were generated, and each window's coverage was calculated as the average coverage over the window length; and, finally, (iii) the contig's coverage was counted as the interquartile median of all the windows' coverage. Thus, a coverage matrix with rows for coassembled contigs longer than 1,000 bp and columns for samples was fed into MetaBAT (37) for binning, with default settings. The resulting bins were quality checked using CheckM (38) for genome percentage completeness and contamination.
To identify E. coli and S. enterica contigs, 59 E. coli and 29 S. enterica complete genomes available from NCBI were used as a reference, and coassembled contigs were searched against these reference genomes using BLAT (39), with default settings. Contigs with best match to either E. coli or S. enterica with aligned length more than 70% of their total length, no less than 95% nucleotide identity, and with E value <1e−6 were selected. MetaBAT bins with >85% of their total length above these cutoffs that matched E. coli or S. enterica were identified as either-species bins. To avoid missing flexible genomic parts specific to certain samples due to the consensus nature of coassembly, we also performed contig recruitment using the above-identified E. coli and S. enterica bins. A second round of BLAT search was carried out to link contigs from individual metagenomic data sets against the coassembled E. coli and S. enterica contigs. Contigs with ≥1,000 bp aligned, E value <1e−10, and nucleotide identity higher than 97% were selected, and the corresponding reads were extracted. We then reassembled those sample-specific E. coli or S. enterica contigs using SPAdes, with k's set to range from 31 to 65 with step width at 5 (29).
Microbial community composition and functional profiles.
To generate initial microbial community taxonomic profiles of the metagenomic samples, we used both a marker gene-based identification strategy using MetaPhlAn version 2.0 (40) and a 16S rRNA-based identification using Parallel-META (41). Metagenomic reads encoding 16S rRNA gene fragments were clustered into operational taxonomic units (OTUs) with QIIME (42), using the Uclust algorithm (43), and taxonomically assigned using the RDP Classifier (44). Resulting OTU counts at the genus level (or another taxonomic rank) were normalized using DESeq (45). Nonmetric multidimensional scaling (NMDS) plots were drawn from normalized OTU counts clustered at the genus level. We also performed a BLAST (46) recruitment to reference genome or contig sequences to determine how abundant individual species or strains are in the samples, as described previously (11, 47). Protein-coding genes were predicted on all contigs (both individual sample assemblies and coassembly) using MetaGeneMark (48). The identified protein-coding genes were searched against the UniProt Reference Clusters database UniRef90 (49), as well as the Virulence Factors of Pathogenic Bacteria database (http://www.mgc.ac.cn/VFs/) (50), using BLAT (39). Matches with more than 70% length aligned, E value of <10−6, and identity of >70% were used for the functional annotation of genes.
Phylogenetic analysis of isolate and population bin genome sequences.
Representative reference strains of E. coli and Salmonella clades were chosen based on the phylogenetic analysis by Luo et al. (51) and Zhou et al. (52), respectively. The sample-specific bins with more than 1,000 protein-coding genes were included; for S. enterica analysis, the isolates from the outbreaks were also included. Orthologs were first identified using reciprocal best matches (RBMs) using BLASTN search with protocols detailed in reference 34. Orthologs belonging to the core genome were then aligned using MUSCLE version 3.8.51 (53), with default settings, and the concatenated alignments were trimmed by Gblocks version 0.91b (54) with default settings. The whole-genome-based phylogenetic tree was calculated based on the Gblocks alignment using FastTree version 2.1.7 (55) with the default model. To investigate the intraspecific diversity of E. coli and S. enterica populations in the metagenomes, ConStrains (15) analysis was carried out for all samples. The output of uniGcodes was used to identify core-genome SNPs.
Accession number(s).
Data have been released in the Sequence Read Archive (SRA) database under BioProject 321753 with accession numbers SAMN05024035 through SAMN05024044.
Supplementary Material
ACKNOWLEDGMENTS
We acknowledge the contributions of scientists at the Alabama Department of Public Health, Bureau of Clinical Laboratories, as well as Joyce Knutsen, Mary Kate Cichon, Linda Brown, and Kristin Mayo from the Colorado Department of Public Health and Environment for collecting and assaying disease specimens. R. Chris Hopkins from Booz Allen Hamilton is acknowledged for technical support in the stool DNA extraction and next-generation sequencing (NGS) library preparation. We also thank the following individuals of the Enteric Diseases Laboratory Branch at the CDC: Nancy Garrett for performing the isolation of Salmonella from the Alabama outbreak samples; Gerardo Gómez for performing the food toxin, culture, and isolate testing for Staphylococcus aureus; and Heather Carleton-Romer and Ashley Sabol for performing the sequencing of Salmonella isolates. Heather is also acknowledged for helpful suggestions on the manuscript.
This work was supported by funds made available from the Centers for Disease Control and Prevention and in part by the U.S. National Science Foundation under award no. 1241046 (to K.T.K.). A.P.-G. was supported by Colciencias-Colombian Administrative Department for Science, Technology and Innovation through a doctoral fellowship.
Footnotes
Supplemental material for this article may be found at https://doi.org/10.1128/AEM.02577-16.
REFERENCES
- 1.Cronquist AB, Mody RK, Atkinson R, Besser J, Tobin D'Angelo M, Hurd S, Robinson T, Nicholson C, Mahon BE. 2012. Impacts of culture-independent diagnostic practices on public health surveillance for bacterial enteric pathogens. Clin Infect Dis 54(Suppl 5):S432–S439. doi: 10.1093/cid/cis267. [DOI] [PubMed] [Google Scholar]
- 2.Amann RI, Ludwig W, Schleifer KH. 1995. Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiol Rev 59:143–169. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Eckburg PB, Bik EM, Bernstein CN, Purdom E, Dethlefsen L, Sargent M, Gill SR, Nelson KE, Relman DA. 2005. Diversity of the human intestinal microbial flora. Science 308:1635–1638. doi: 10.1126/science.1110591. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Hold GL, Pryde SE, Russell VJ, Furrie E, Flint HJ. 2002. Assessment of microbial diversity in human colonic samples by 16S rDNA sequence analysis. FEMS Microbiol Ecol 39:33–39. doi: 10.1111/j.1574-6941.2002.tb00904.x. [DOI] [PubMed] [Google Scholar]
- 5.Scallan E, Hoekstra RM, Angulo FJ, Tauxe RV, Widdowson MA, Roy SL. 2011. Foodborne illness acquired in the United States—major pathogens. Emerg Infect Dis 17:7–15. doi: 10.3201/eid1701.P11101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Acinas SG, Sarma-Rupavtarm R, Klepac-Ceraj V, Polz MF. 2005. PCR-induced sequence artifacts and bias: insights from comparison of two 16S rRNA clone libraries constructed from the same sample. Appl Environ Microbiol 71:8966–8969. doi: 10.1128/AEM.71.12.8966-8969.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Loman NJ, Constantinidou C, Christner M, Rohde H, Chan JZ, Quick J, Weir JC, Quince C, Smith GP, Betley JR, Aepfelbacher M, Pallen MJ. 2013. A culture-independent sequence-based metagenomics approach to the investigation of an outbreak of Shiga-toxigenic Escherichia coli O104:H4. JAMA 309:1502–1510. doi: 10.1001/jama.2013.3231. [DOI] [PubMed] [Google Scholar]
- 8.Mellmann A, Harmsen D, Cummings CA, Zentz EB, Leopold SR, Rico A, Prior K, Szczepanowski R, Ji Y, Zhang W, McLaughlin SF, Henkhaus JK, Leopold B, Bielaszewska M, Prager R, Brzoska PM, Moore RL, Guenther S, Rothberg JM, Karch H. 2011. Prospective genomic characterization of the German enterohemorrhagic Escherichia coli O104:H4 outbreak by rapid next generation sequencing technology. PLoS One 6:e22751. doi: 10.1371/journal.pone.0022751. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Tranter HS. 1990. Foodborne staphylococcal illness. Lancet 336:1044–1046. doi: 10.1016/0140-6736(90)92500-H. [DOI] [PubMed] [Google Scholar]
- 10.Baird-Parker AC. 1990. Foodborne salmonellosis. Lancet 336:1231–1235. doi: 10.1016/0140-6736(90)92844-8. [DOI] [PubMed] [Google Scholar]
- 11.Rodriguez RL, Konstantinidis KT. 2014. Estimating coverage in metagenomic data sets and why it matters. ISME J 8:2349–2351. doi: 10.1038/ismej.2014.76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Rodriguez RL-M, Konstantinidis KT. 2014. Bypassing cultivation to identify bacterial species. Microbe Mag 9:111–118. doi: 10.1128/microbe.9.111.1. [DOI] [Google Scholar]
- 13.Peng Y, Leung HC, Yiu SM, Chin FY. 2012. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28:1420–1428. doi: 10.1093/bioinformatics/bts174. [DOI] [PubMed] [Google Scholar]
- 14.Haider B, Ahn T-H, Bushnell B, Chai J, Copeland A, Pan C. 2014. Omega: an overlap-graph de novo assembler for metagenomics. Bioinformatics 30:2717–2722. doi: 10.1093/bioinformatics/btu395. [DOI] [PubMed] [Google Scholar]
- 15.Luo C, Knight R, Siljander H, Knip M, Xavier RJ, Gevers D. 2015. ConStrains identifies microbial strains in metagenomic datasets. Nat Biotechnol 33:1045–1052. doi: 10.1038/nbt.3319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Human Microbiome Project Consortium. 2012. A framework for human microbiome research. Nature 486:215–221. doi: 10.1038/nature11209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Pusz P, Bok E, Mazurek J, Stosik M, Baldy-Chudzik K. 2014. Type 1 fimbriae in commensal Escherichia coli derived from healthy humans. Acta Biochim Pol 61:389–392. [PubMed] [Google Scholar]
- 18.Bhavsar AP, Brown NF, Stoepel J, Wiermer M, Martin DD, Hsu KJ, Imami K, Ross CJ, Hayden MR, Foster LJ, Li X, Hieter P, Finlay BB. 2013. The Salmonella type III effector SspH2 specifically exploits the NLR co-chaperone activity of SGT1 to subvert immunity. PLoS Pathog 9:e1003518. doi: 10.1371/journal.ppat.1003518. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Moran GJ, Krishnadasan A, Gorwitz RJ, Fosheim GE, McDougal LK, Carey RB, Talan DA, EMERGEncy ID Net Study Group . 2006. Methicillin-resistant S. aureus infections among patients in the emergency department. N Engl J Med 355:666–674. doi: 10.1056/NEJMoa055356. [DOI] [PubMed] [Google Scholar]
- 20.Singh P, Teal TK, Marsh TL, Tiedje JM, Mosci R, Jernigan K, Zell A, Newton DW, Salimnia H, Lephart P, Sundin D, Khalife W, Britton RA, Rudrik JT, Manning SD. 2015. Intestinal microbial communities associated with acute enteric infections and disease recovery. Microbiome 3:1–12. doi: 10.1186/s40168-014-0066-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Greenblum S, Turnbaugh PJ, Borenstein E. 2012. Metagenomic systems biology of the human gut microbiome reveals topological shifts associated with obesity and inflammatory bowel disease. Proc Natl Acad Sci U S A 109:594–599. doi: 10.1073/pnas.1116053109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, Ley RE, Sogin ML, Jones WJ, Roe BA, Affourtit JP, Egholm M, Henrissat B, Heath AC, Knight R, Gordon JI. 2009. A core gut microbiome in obese and lean twins. Nature 457:480–484. doi: 10.1038/nature07540. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Schloissnig S, Arumugam M, Sunagawa S, Mitreva M, Tap J, Zhu A, Waller A, Mende DR, Kultima JR, Martin J, Kota K, Sunyaev SR, Weinstock GM, Bork P. 2013. Genomic variation landscape of the human gut microbiome. Nature 493:45–50. doi: 10.1038/nature11711. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Pop M, Walker AW, Paulson J, Lindsay B, Antonio M, Hossain MA, Oundo J, Tamboura B, Mai V, Astrovskaya I, Corrada Bravo H, Rance R, Stares M, Levine MM, Panchalingam S, Kotloff K, Ikumapayi UN, Ebruke C, Adeyemi M, Ahmed D, Ahmed F, Alam MT, Amin R, Siddiqui S, Ochieng JB, Ouma E, Juma J, Mailu E, Omore R, Morris JG, Breiman RF, Saha D, Parkhill J, Nataro JP, Stine OC. 2014. Diarrhea in young children from low-income countries leads to large-scale alterations in intestinal microbiota composition. Genome Biol 15:R76. doi: 10.1186/gb-2014-15-6-r76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Carroll IM, Ringel-Kulka T, Siddle JP, Ringel Y. 2012. Alterations in composition and diversity of the intestinal microbiota in patients with diarrhea-predominant irritable bowel syndrome. Neurogastroenterol Motil 24:521–530, e248. doi: 10.1111/j.1365-2982.2012.01891.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.David LA, Weil A, Ryan ET, Calderwood SB, Harris JB, Chowdhury F, Begum Y, Qadri F, LaRocque RC, Turnbaugh PJ. 2015. Gut microbial succession follows acute secretory diarrhea in humans. mBio 6:e00381-15. doi: 10.1128/mBio.00381-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Konstantinidis KT, Tiedje JM. 2007. Prokaryotic taxonomy and phylogeny in the genomic era: advancements and challenges ahead. Curr Opin Microbiol 10:504–509. doi: 10.1016/j.mib.2007.08.006. [DOI] [PubMed] [Google Scholar]
- 28.Cox MP, Peterson DA, Biggs PJ. 2010. SolexaQA: at-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinformatics 11:485. doi: 10.1186/1471-2105-11-485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA. 2012. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 19:455–477. doi: 10.1089/cmb.2012.0021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Treangen TJ, Ondov BD, Koren S, Phillippy AM. 2014. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biol 15:524. doi: 10.1186/s13059-014-0524-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30:1312–1313. doi: 10.1093/bioinformatics/btu033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Rambaut A. 2011. FigTree. Molecular evolution, phylogenetics and epidemiology. http://tree.bio.ed.ac.uk/software/figtree/.
- 33.Rotmistrovsky K, Agarwala R. 2011. BMTagger: Best Match Tagger for removing human reads from metagenomics datasets. ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/bmtagger/.
- 34.Luo C, Tsementzi D, Kyrpides NC, Konstantinidis KT. 2012. Individual genome assembly from complex community short-read metagenomic datasets. ISME J 6:898–901. doi: 10.1038/ismej.2011.147. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Langmead B, Salzberg SL. 2012. Fast gapped-read alignment with Bowtie 2. Nat Methods 9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup . 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Kang DD, Froula J, Egan R, Wang Z. 2015. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ 3:e1165. doi: 10.7717/peerj.1165. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2015. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res 25:1043–1055. doi: 10.1101/gr.186072.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Kent WJ. 2002. BLAT—the BLAST-like alignment tool. Genome Res 12:656–664. doi: 10.1101/gr.229202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C. 2012. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat Methods 9:811–814. doi: 10.1038/nmeth.2066. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Su X, Pan W, Song B, Xu J, Ning K. 2014. Parallel-META 2.0: enhanced metagenomic data analysis with functional annotation, high performance computing and advanced visualization. PLoS One 9:e89323. doi: 10.1371/journal.pone.0089323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Pena AG, Goodrich JK, Gordon JI, Huttley GA, Kelley ST, Knights D, Koenig JE, Ley RE, Lozupone CA, McDonald D, Muegge BD, Pirrung M, Reeder J, Sevinsky JR, Turnbaugh PJ, Walters WA, Widmann J, Yatsunenko T, Zaneveld J, Knight R. 2010. QIIME allows analysis of high-throughput community sequencing data. Nat Methods 7:335–336. doi: 10.1038/nmeth.f.303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Edgar RC. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26:2460–2461. doi: 10.1093/bioinformatics/btq461. [DOI] [PubMed] [Google Scholar]
- 44.Wang Q, Garrity GM, Tiedje JM, Cole JR. 2007. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol 73:5261–5267. doi: 10.1128/AEM.00062-07. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Anders S, Huber W. 2010. Differential expression analysis for sequence count data. Genome Biol 11:R106. doi: 10.1186/gb-2010-11-10-r106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol 215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- 47.Konstantinidis KT, DeLong EF. 2008. Genomic patterns of recombination, clonal divergence and environment in marine microbial populations. ISME J 2:1052–1065. doi: 10.1038/ismej.2008.62. [DOI] [PubMed] [Google Scholar]
- 48.Borodovsky M, Lomsadze A. 2014. Gene identification in prokaryotic genomes, phages, metagenomes, and EST sequences with GeneMarkS suite. Curr Protoc Bioinformatics Chapter 4:Unit 4.5-1–4.5-17. [DOI] [PubMed] [Google Scholar]
- 49.Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. 2007. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23:1282–1288. doi: 10.1093/bioinformatics/btm098. [DOI] [PubMed] [Google Scholar]
- 50.Chen L, Yang J, Yu J, Yao Z, Sun L, Shen Y, Jin Q. 2005. VFDB: a reference database for bacterial virulence factors. Nucleic Acids Res 33:D325–D328. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Luo C, Walk ST, Gordon DM, Feldgarden M, Tiedje JM, Konstantinidis KT. 2011. Genome sequencing of environmental Escherichia coli expands understanding of the ecology and speciation of the model bacterial species. Proc Natl Acad Sci U S A 108:7200–7205. doi: 10.1073/pnas.1015622108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Zhou Z, McCann A, Weill FX, Blin C, Nair S, Wain J, Dougan G, Achtman M. 2014. Transient Darwinian selection in Salmonella enterica serovar Paratyphi A during 450 years of global spread of enteric fever. Proc Natl Acad Sci U S A 111:12199–12204. doi: 10.1073/pnas.1411012111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Edgar RC. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797. doi: 10.1093/nar/gkh340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Talavera G, Castresana J. 2007. Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst Biol 56:564–577. doi: 10.1080/10635150701472164. [DOI] [PubMed] [Google Scholar]
- 55.Price MN, Dehal PS, Arkin AP. 2010. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS One 5:e9490. doi: 10.1371/journal.pone.0009490. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Human Microbiome Project Consortium. 2012. Structure, function and diversity of the healthy human microbiome. Nature 486:207–214. doi: 10.1038/nature11234. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.