Abstract
Just as the expansion in genome sequencing has revealed and permitted the exploitation of phylogenetic signals embedded in bacterial genomes, the application of metagenomics has begun to provide similar insights at the ecosystem-level for microbial communities. However, little is known regarding this aspect of bacteriophage associated with microbial ecosystems, and if phage encode discernible habitat-associated signals diagnostic of underlying microbiomes. Here we demonstrate that individual phage can encode clear habitat-related “ecogenomic signatures”, based on relative representation of phage encoded gene homologues in metagenomic datasets. Furthermore, we show the ecogenomic signature encoded by the gut-associated ɸB124-14 can be used to segregate metagenomes according to environmental origin, and distinguish “contaminated” environmental metagenomes (subject to simulated in silico human faecal pollution) from uncontaminated datasets. This indicates phage encoded ecological signals likely possess sufficient discriminatory power for use in biotechnological applications, such as development of microbial source tracking tools for monitoring water quality.
Keywords: Human gut microbiome, metagenomics, faecal pollution, phage ecology
Introduction
The faecal contamination of environmental waters used for drinking and recreational purposes poses a major potential risk to public health. Detection of faecal contamination and determination of its origin (microbial source tracking; MST) is an emerging element in managing these risks and safeguarding water quality. At present, the cultivation of faecal indicator bacteria (FIB) from water samples, such as faecal coliforms, Escherichia coli, and Enterococcus spp., remains a mainstay of methods for detecting faecal pollution of water resources (Harwood et al., 2014; Leclerc et al., 2001; Griffith et al., 2009; Ahmed et al., 2016). Although the detection and enumeration of FIB have long been useful in strategies to improve and maintain water quality, they are subject to a range of limitations that impair their overall utility. Limitations include their lack of specificity to human faeces, poor persistence or potential regrowth in certain environments, and long turnaround times associated with culture-based detection (Haack et al., 2003; McLellan and Salmore, 2003; Whitman et al., 2003; Ishii et al., 2006).
Consequently, numerous alternative human-specific MST approaches have been developed in recent years, including both culture-dependent and molecular-based approaches. Culture-independent, molecular-based approaches to MST are increasingly attractive as they offer the potential to overcome certain limitations inherent in culture-dependent approaches. These include a reduced turnaround time and improved sensitivity, which should lead to more efficient quantification and prediction of risk. Ultimately, molecular-based MST approaches could conceivably deliver an indication of water quality directly at the point of sample collection, and in near real-time (Tan et al., 2015).
To date, the development of molecular-based MST methods has focused primarily on the detection and amplification of target genes or sequences associated with specific faecal bacteria, typically using either end-point or quantitative PCR (Harwood et al., 2014, Gómez-Doñate et al., 2016). More recently, improved access to high-throughput next generation sequencing technologies, along with the growing portability, ease-of-use, and affordability of such systems, have begun to offer the prospect of developing metagenomic approaches to MST (e.g. Knights et al., 2011; Tan et al., 2015). The application of metagenomics to MST should permit high-resolution methods based on surveillance of whole microbial communities, and identification of habitat-specific genetic patterns that can distinguish microbial ecosystems (also termed “ecogenomic signatures”).
Alternatives to FIB are also likely to be important in the development of more effective MST tools. In particular, the detection of human gut-specific bacteriophage (phage) that infect anaerobic gut bacteria are increasingly viewed as potentially superior indicators of pollution compared to direct detection of their bacterial host. The advantages of phage for MST are a longer environmental persistence, greater abundance than the host bacteria, and the ability of phage to replicate within cultured host species. All of which can serve to amplify any signal of human faecal contamination and improve sensitivity (Gómez-Doñate et al., 2011; Lee et al., 2009). These potential advantages of phage in MST are further supported by reports of the isolation and characterisation of apparently human gut-specific phage, and the subsequent use of these as MST tools (Payan et al., 2005; Gómez-Doñate et al., 2011; Lee et al., 2009; Jofre et al., 2014; Harwood et al., 2013; McMinn et al., 2014).
Furthermore, many of the advantages offered by phage in traditional culture-based MST methods (Jofre et al., 2014) would also seem to apply to the development of phage-based culture-independent approaches. These include metagenomic MST tools, which could conceivably target the entire retinue of viruses associated with a particular microbial ecosystem (the virome). However, the potential for such virome-based metagenomic MST is currently uncertain, and first requires fundamental study in order to define the principles under which phage-based metagenomic MST could operate. In particular, it remains unclear to what extent individual phage, or wider phage communities, associated with target ecosystems are diagnostic of underlying host microbiomes and contain unambiguous ecogenomic signals which offer sufficient discriminatory power for MST.
Here we hypothesise that individual human-gut associated phage, infecting key members of this microbiome, will encode a distinct habitat-associated signal derived from the co-evolution and adaptation of phage and host to life within the human gut. If so, homologues of genes encoded by such phage should display an increased relative abundance in human gut-derived metagenomes, compared to metagenomes from other microbial ecosystems. To test these theories, we utilised publically-available viral and whole community metagenomic datasets to develop a comprehensive ecological profile of ɸB124-14, a phage previously proven to infect a restricted set of human-associated Bacteroides fragilis strains, including those with MST utility (Ogilvie et al., 2012; Ebdon et al., 2007), and compared this to phage from non-gut habitats. Our previous genetic and ecological profiling of ɸB124-14, indicated that this phage has utility as a marker of human faecal pollution, with potential as a platform for the development of quantitative molecular MST tools (Ogilvie et al., 2012). As such, ɸB124-14 constitutes an excellent model with which to begin to explore the existence of habitat-specific ecogenomic-signatures in phage genomes and their application to development of improved MST approaches.
Results
Representation of sequences with similarity to bacteriophage encoded open reading frames (ORFs) in viral metagenomes
To evaluate the relative representation of genes with similarity to those encoded by ɸB124-14 in viral metagenomes, we calculated the cumulative relative abundance of sequences similar to translated ɸB124-14 open reading frames (ORFs) in each metagenome (Figure 1). These datasets encompassed the human, porcine and bovine gut, as well as a broad range of aquatic environmental habitats (see Supplementary Table S1). Sequences generating valid hits to at least one ɸB124-14 ORF were identified in all datasets evaluated, but a significantly greater mean relative abundance of ɸB124-14 encoded ORFs was evident in human gut viromes, compared with environmental datasets (Figure 1a). No significant differences were apparent between the mean cumulative relative abundance of ɸB124-14 human gut viromes and other gut viromes examined (Figure 1a). Individual human gut viromes were also observed to display a notably greater variation in ɸB124-14 cumulative relative abundance than other datasets analysed (Figure 1a).
Figure 1. Cumulative relative abundance of sequences with similarity to open reading frames encoded by Bacteroides ɸB124-14, Cyanophage SYN5, and Burkholderia phage KS10 in viral metagenomes.
Reads from each virome were mapped to translated ɸB124-14, ɸSYN5 or ɸKS10 ORFS using BlastX. Details of datasets used are provided in Supplementary Table S1. (a-c) Relative representation of phage ORFs across habitats represented by viromes. Charts show cumulative relative abundance of sequences with homology to ORFs encoded by Bacteroides ɸB124-14, Cyanobacteria ɸSYN5 and Burkholderia ɸKS10. For environmental datasets, those derived from temperate marine environments most relevant to the predicted ɸSYN5 host habitat were also analysed as a distinct subgroup. (d-g) Comparison of phage representation within specific habitats. Charts show cumulative relative abundance of sequences with homology to ORFs from each phage examined in viral metagenomes from the human gut, porcine gut, bovine gut and the environment. In all figures bars show mean plus SEM and statistically significant differences denoted by * P ≤ 0.05, ** P ≤ 0.01 **** P ≤ 0.0001, vs Human Gut Viromes (a-c) or ɸB124-14 (d-g).
To determine if these “gut-associated” ɸB124-14 relative abundance profiles represented a habitat-related signal in ɸB124-14, or could be attributed to properties of phage genomes or the human gut virome in general, we repeated this experiment using additional genomes from phage not considered to be associated with the human gut. These included the Cyanophage SYN5 (Pope et al., 2007), and the Burkholderia prophage KS10 (Goudie et al., 2008). ɸSYN5 was isolated from temperate marine environments, while ɸKS10 was identified in B. cenocepacia strain K56-2, an organism typically associated with the plant rhizosphere, but also an opportunistic human pathogen (Seed and Dennis, 2005). Based on tetranucleotide profiling, ɸKS10 has previously been shown to be among the most distantly related phage to ɸB124-14 (Ogilvie et al., 2012).
Neither ɸSYN5 nor ɸKS10 exhibited the gut-associated enrichment of similar ORFs evident for ɸB124-14, when cumulative relative abundance profiles of each phage were considered across all habitats represented (Figure 1b,c). However, ɸSYN5 displayed a significantly greater representation in a subset of datasets from marine environments relative to gut viromes, congruent with its environmental origin, and indicative of an ecological profile distinct from ɸB124-14 (Figure 1b). In contrast, sequences similar to ɸKS10 ORFs appeared to be only very poorly represented in the majority of datasets examined, with no discernible ecogenomic profile identified within the datasets analysed (Figure 1c). Comparison of phage-to-phage relative abundance profiles within specific habitats reinforced the potential for a gut-associated ecogenomic signal in ɸB124-14, with ɸSYN5 and ɸKS10 shown to have significantly lower representation in all gut-derived viromes examined (Figure 1d-g).
Detection of the ɸB124-14 ecogenomic signal in whole community metagenomes
Because the human gut virome is believed to be dominated by temperate phage (Reyes et al., 2010; Minot et al., 2011), and we have previously demonstrated that conventional whole community shotgun metagenomes derived from human gut bacteria capture notable fractions of the gut-associated Bacteroides phage population (Ogilvie et al., 2013), we next explored the representation of ɸB124-14 ORFs in assembled whole community metagenomes. These encompassed datasets derived from the human gut and other body sites, as well as a range of non-human gut and environmental habitats (Supplementary Table S1).
Analysis of the cumulative relative abundance of sequences with similarity to ɸB124-14 ORFs across habitats, showed no significant differences between whole community human gut metagenomes, and non-human gut or environmental datasets (Figure 2a). A significantly decreased representation at other human body sites compared to the human gut was detected (Figure 2a). Identical analyses using ɸSYN5 showed that, compared to human gut datasets, ɸSYN5 ORFs had significantly greater representation in environmental datasets, congruent with the environmental origin of this phage (Figure 2b). ɸKS10 again showed no discernible ecological profile within these datasets (Figure 2c).
Figure 2. Cumulative relative abundance of sequences with similarity to open reading frames encoded by ɸB124-14, Cyanophage SYN5, and Burkholderia phage KS10 in assembled whole community metagenomes.
Datasets were searched using translated ɸB124-14, ɸSYN5, or ɸKS10 ORF sequences using tBlastn. Valid hits were used to calculate the cumulative relative abundance sequences with similarity to phage ORFs in each dataset (expressed as Hits/Mb). (a-c) Relative representation of phage ORFs across habitats represented by whole community metagenomes. Charts show cumulative relative abundance of sequences with similarity to ORFs encoded by Bacteroides ɸB124-14, Cyanobacteria ɸSYN5, and Burkholderia ɸKS10. (d-g) Comparison of phage representation within specific habitats. Charts show cumulative relative abundance of sequences with similarity to ORFs from each phage examined in whole community metagenomes from the human gut, human oral cavity (mouth and throat), other human body sites (skin, nares, vagina), non-human gut, and wider environment. For all datasets, bars show mean plus SEM. *** P < 0.001, **** P < 0.0001 vs Environmental Viromes (a-c) or ɸB124-14 (d-g).
When phage relative abundance profiles were compared directly within specific habitats on a phage-to-phage basis, a significantly greater representation of sequences with similarity to ɸB124-14 ORFs was apparent in human-derived datasets in general, compared with ɸSYN5 or ɸKS10 (Figure 2d-f). ɸB124-14 ORFs also showed significantly greater representation in non-human gut metagenomes compared to ɸSYN5 (Figure 2g), but no significant differences were noted between phage when environmental metagenomes were examined (Figure 2h).
The ɸB124-14 ecogenomic signal can discriminate human gut viromes from other datasets
Given the observed enrichment of sequences with similarity to ɸB124-14 ORFs in mammalian gut-derived viral metagenomes, and other human-derived whole community metagenomes, we next examined the potential for this putative ecogenomic profile to distinguish human gut metagenomes from those derived from other habitats. We reasoned that a genuine habitat-related ecogenomic signature should permit the accurate segregation and grouping of metagenomic datasets based on their environmental origin. To test this, non-metric multidimensional scaling (nMDS) was used for unsupervised ordination of individual metagenomes, based on relative abundance profiles of ɸB124-14 ORFs in each dataset. The level and significance of separation between groups of metagenomes was subsequently investigated using Analysis of Similarities (ANOSIM) (Clarke, 1993). To increase stringency only metagenomes with representation of at least 2 distinct phage ORFs were included in this analysis.
Ordination of all available datasets based on the ɸB124-14 relative abundance profile, generated a clear overall separation between viral metagenomes and those derived from whole communities (Figure 3a,c; Supplementary Figure S1; Supplementary Table S2). Assembly of datasets was indicated to have only minimal impact on nMDS distributions based on ordination of assembled human gut viromes. These datasets displayed lower overall relative abundance values than unassembled counterparts, but collectively remained closely associated with unassembled datasets, and strongly separated from whole community metagenomes (Figure 3a,c; Supplementary Figure S1). When the relationship between viral datasets was examined in more detail, human gut viromes were observed to exhibit a clear and significant separation from other viral datasets (bovine, porcine and environmental) based on the ɸB124-14 relative abundance profile (Figure 3b,c; Supplementary Figure S1).
Figure 3. Unsupervised ordination of metagenomic datasets based on phage ecogenomic signatures.
Non-metric multidimensional scaling (nMDS) was used to ordinate individual metagenomic datasets based on the relative abundance profiles of individual either ORFs from either ɸB124-14 or ɸSYN5. The strength and significance of separation between groups of metagenomes with related environmental origins was evaluated using Analysis of Similarities (ANOSIM). To reduce noise and increase stringency, only metagenomes with representation of 2 or more distinct phage ORFs were included in this analysis. (a,b,d,e) nMDS ordination of all metagenomes (all datasets), or exclusively viral metagenomes (viromes only), based on ɸB124-14 or ɸSYN5 ORF relative abundance profiles. Filled ellipses show standard deviation of dispersion of each group relative to the group centroid. For nMDS based on ɸSYN5 relative abundance profiles no datasets from Human Gut Virome assemblies, Human Oral cavity or Human Body sites met the minimum criteria for inclusion. (c,f) ANOSIM analysis of differences between groups of metagenomes using in nMDS. Charts show the ANOSIM R statistic for each comparison relative to the unassembled human gut viral datasets. An increasing strength of separation between groups is indicated as the R statistic approaches 1 (total separation). Symbols above bars indicate statistical significance of observed separation between groups: ** P ≤ 0.001; * p ≤ 0.05. For ɸSYN5 analyses, groups where no datasets met the threshold criteria for representation of a minimum of 2 distinct ORFs, were not included in nMDS or ANOSIM and indicated as “Failed Detection Threshold” in Part f. Human Gut Viromes, Bovine Viromes, Porcine Viromes, Env Viromes – unassembled viral metagenomes derived respectively from the human, bovine, and porcine gut, or of non-host associated environmental origin; Human Gut Viromes (assem) – Assemblies of Human Gut Viral datasets; Human Gut Whole, NH Gut – Whole community datasets derived from human or non-human gut respectively; Body, Oral – Whole community metagenomes from various human body sites or the oral cavity respectively. Env Whole – Whole community metagenomes non-host associated environmental origin. Details of datasets in each group are provided in Supplementary Table S1.
In contrast, ɸSYN5 ORF relative abundance profiles provided considerably poorer resolution of metagenome groups, and reduced the number of metagenome groups meeting minimum criteria for inclusion in this analysis (Figure 3d-f; Supplementary Figure S1). Use of the ɸSYN5 ecogenomic profile resulted in more highly dispersed groups, with less separation of viral datasets from each other, and from the whole community environmental metagenome group (Figure 3d-f). A notable exception was an apparently enhanced ability to distinguish porcine and human gut-derived metagenomes with the ɸSYN5 profile (Figure 3e,f). A comparable analysis using ɸKS10 was not possible due to the very low representation of sequences with homology to ɸKS10 ORFs in the majority of datasets.
Use of the ɸB124-14 ecogenomic signature to identify human-associated pollution in environmental datasets
To evaluate the potential of the ɸB124-14 ecogenomic signature to identify the presence of human gut-associated pollution in environmental samples, we simulated the contamination of environmental viromes with human gut virome content. This was performed by adding the average human gut-derived relative abundance profile of ɸB124-14, to profiles obtained from environmental viral datasets. ɸB124-14 gut-associated profiles were added to environmental profiles at “strengths” ranging from 100-0.01%, to explore the range over which the ɸB124-14 gut-associated ecogenomic signal may be detectable when combined with background environmental signals.
This showed a correlation between dilution of the ɸB124-14 human gut-associated ecogenomic signal, and separation of “contaminated” datasets from human gut or “uncontaminated” environmental viromes (Figure 4a,b). As the ɸB124-14 ecogenomic signal strength decreased, contaminated datasets exhibited correspondingly increased separation from human gut viromes by nMDS and ANOSIM, and a closer association with uncontaminated environmental metagenomes (Figure 4a,b). In addition, it is notable that contamination of environmental datasets with the human gut-derived ɸB124-14 ecogenomic signature also provided a clear indication of human gut-associated pollution specifically, and these datasets remained distinct and well-separated from bovine and porcine viromes (Figure 4a,b). In contrast, the same experiment using the ɸSYN5 human-gut derived relative abundance profile, provided no discernible separation of contaminated environmental datasets from uncontaminated viromes, in keeping with the alternative environmental ecogenomic signature exhibited by this phage, and reinforcing the gut-specific nature of the ɸB124-14 relative abundance profile across these datasets (Figure 4c,d).
Figure 4. Detection of human gut associated ecogenomic signals in simulated ”polluted” environmental datasets.
The potential for the ɸB124-14 ecogenomic signal to identify human faecal pollution in environmental datasets was explored by simulating pollution of selected environmental viromes. This was achieved by combining average human gut virome ɸB124-14, or SYN5 relative abundance profiles, with those of selected environmental viromes. Human gut associated profiles were combined at “strengths” ranging from 100% to 0.01% of human gut virome average, with profiles of viromes from the Bay of British Columbia, Sargasso Sea, Gulf of Mexico, Tampa Bay, and Reclaimed Water. Relationships between groups of “uncontaminated” and “polluted” metagenomes were explored using non-metric multi-dimensional scaling (nMDS) and analysis of similarities (ANOSIM) as for Figure 3. (a,c) nMDS ordination of uncontaminated metagenomes and those modified to include either ɸB124-14 or SYN5 human gut virome profiles. Filled ellipses show standard deviation of dispersion of each group relative to the group centroid. Black ellipse denotes groups of “polluted” environmental datasets, with “strength” (100-0.01%) of human gut signal added. (b,d) ANOSIM analysis of the differences between groups of metagenomes used in nMDS ordination. Charts show the ANOSIM R statistic for each uncontaminated group of metagenomes compared with datasets modified to simulate different levels of human faecal pollution. An increasing strength of separation between groups is indicated as the R statistic approaches 1 (total separation). Open symbols indicate no significant separation from the polluted dataset compared, while closed symbols indicate significant separation (P ≤ 0.05). Human Gut Viromes, Bovine Viromes, Porcine Viromes, Env Viromes – unassembled viral metagenomes derived respectively from the human, bovine, and porcine gut, or of non-host associated environmental origin; Env Whole – Whole community metagenomes non-host associated environmental origin. Details of datasets in each group are provided in Supplementary Table S1.
Identification of human gut-associated genes in the ɸB124-14 genome
To further delineate the human gut-associated ecogenomic signal inherent in ɸB124-14, and to identify genome regions with the strongest gut affiliation, we next explored the representation of individual ɸB124-14 ORFs in all metagenomes in more detail. This revealed that a subset of ɸB124-14 ORFs appear to exhibit a highly cosmopolitan distribution across ecosystems, with similar sequences in >50% of all datasets examined and representation in almost every habitat examined (Figure 5a,b; Supplementary Table S3). These cosmopolitan ORFs are distributed throughout the ɸB124-14 genome and encode diverse functions including DNA recombination and repair, thymidylate synthase activity, peptidase activity, and a phage anti-repressor, as well as ORFs of unknown function (Figure 5a,b; Supplementary Table S3). The other phage genomes examined also contained examples of cosmopolitan ORFs, which were predicted to encode functions similar to counterparts in ɸB124-14 (Supplementary Figures S2 & S3; Supplementary Table S3).
Figure 5. Identification of human gut-associated genes in the ɸB124-14 genome.
The representation of each ɸB124-14 ORF in all datasets was used to assess the consistency of the human-gut associated ecogenomic signal across the phage genome, and identify ORFs with human gut affiliations. (a) Average relative abundance (hits/Mb), and representation of ɸB124-14 ORFS across all 840 datasets examined. Colours of bars indicate the % of datasets with at least one valid hit to each ORF as described in the associated legend. Significant differences in average relative abundance for ORFs represented in 50% of more of the datasets examined are shown by symbols above bars and colours indicate significance vs all other ɸB124-14 ORFs, or significance vs all other ɸB124-14 ORFs with less than 50% representation in datasets examined. Bars show SEM. (b) Heatmap showing relative abundance of individual ɸB124-14 ORFs in each metagenomic dataset examined. Columns represent ORFs as indicated on Part (a) x-axis, and rows represent metagenomic datasets. The intensity of shading of each cell represented the relative abundance (hits/Mb) of each ORF in each particular metagenome, corresponding to the scale provided. (c) Relative representation of ɸB124-14 ORFs in human gut-derived viral datasets compared to other viromes. Points show the average relative abundance of each ORF in viral metagenomes from each category, expressed as Log10 hits/Mb. Membership of each ORF with previously described functional gene clusters in the ɸB124-14 genome (Ogilvie et al., 2012) is indicated below the x-axis. Symbols above points indicate significantly greater relative abundance in human gut viromes compared with either all other viromes, or compared with those of environmental origin. * P < 0.05; ** P< 0.01; *** P < 0.001; **** P < 0.0001. Details of datasets in each group are provided in Supplementary Table S1.
This analysis also revealed a range of ORFs in the ɸB124-14 genome with a seemingly clear-cut human gut affiliation (Figure 5b,c; Supplementary Table S4). These ORFs were relatively well represented in human gut viromes and human gut whole community datasets, as well as other mammalian gut viromes, but overall poorly represented in datasets from other habitats (Figure 5b). These gut-associated ORFs were distributed throughout the ɸB124-14 genome, with a notable concentration in regions of the genome predicted to be involved in synthesis of the viral capsid and genome packaging (Figure 5c; Ogilvie et al., 2012). When the representation of these gut-affiliated ɸB124-14 genomic regions was considered in viral datasets specifically, many were found to exhibit a significant enrichment in human gut viromes compared to environmental viromes, or in some cases all other viral datasets (Figure 5c). In accordance with the other analyses conducted, no comparable human gut-associated pattern was observed for ɸSYN5 and ɸKS10 genomes, but ɸSYN5 ORFs were observed to be well represented in environmental datasets relative to other metagenomes examined (Supplementary Figures S2 & S3).
Simulation and modelling of virome-based source tracking using ɸB124-14 ecogenomic signatures
To further probe the robustness of this habitat related signal, and begin to provide insight into the potential sensitivity, specificity, and accuracy of virome-based MST tools, we next simulated a more expansive and varied set of environmental viromes. This was achieved through random permutation of ecogenomic profiles derived from environmental datasets, followed by introduction of random levels of human, bovine or porcine pollution (based on addition of respective ɸB124-14 ecogenomic profiles). Ordination of these permuted and polluted datasets by nMDS indicated that the ɸB124-14 ecogenomic signal was still able to clearly segregate all groups of data, and in proportion to the strength of human, bovine, or porcine signal applied (Figure 6a). Datasets with lower levels of human and bovine pollution were also observed to converge, in keeping with previous analyses, but still remained clearly segregated from uncontaminated environmental datasets (Figure 6a). Overall, this analysis suggested that the potential discriminatory power of the ɸB124-14 ecogenomic signal was preserved despite the additional wide variation in the innate background environmental signal, and that it could also distinguish different sources of pollution.
Figure 6. Simulation and modelling of virome based source tracking using ɸB124-14 ecogenomic signatures.
To evaluate the potential for the ɸB124-14 ecogenomic signature to be used in MST, we undertook more extensive Monte Carlo based simulations of pollution using randomly permuted and polluted environmental viromes, and specific detection of human pollution using ɸB124-14 ORF relative abundance profiles. a) nMDS and ANOSIM analysis of uncontaminated and ”polluted” permutations of environmental viral metagenomes. Symbol shape for polluted datasets (human, bovine, or porcine) represents the strength of contamination as indicated by the associated key. ANOSIM shows the separation of groups of datasets with varying ranges of human or animal contamination, from uncontaminated environmental viromes (** P = 0.001). ENVU – uncontaminated environmental virome permutations; ENVHGV - environmental virome permutations contaminated by human gut ecogenomic signature; ENVBOV - environmental virome permutations contaminated by bovine gut ecogenomic signature; ENVPORC - environmental virome permutations contaminated by porcine gut ecogenomic signature; b) ROC curves were constructed from randomly permuted and polluted datasets displayed in part (a), based on relative abundance profiles from all ɸB124-14 ORFS, or a sub sets of ORFS exhibiting significantly greater mean relative abundance in human gut viromes than other datasets (See Figure 5c). Subset 1 ORFS = 5, 16, 18, 20, 21, 22, 23, 25, 34, 36, 43, 44, 59, 61, 67; Subset 2 ORFS = 16, 34, 56. The area under curve (AUC) for each ROC curve indicate the diagnostic potential for cumulative relative abundance of each ORF combination to distinguish different groups of datasets, with values approaching 0.5 indicating little or no diagnostic power. All AUC were statistically significant at P ≤ 0.002. c) Histograms show the proportion of datasets of each type (ENVU; ENVHGV; ENVBOV; ENVPORC) accurately identified by a 2 step classification approach using threshold values indicative of either pollution in general (Step 1) or human pollution more specifically (Step 2), selected based on sensitivity and specificity values generated by ROC analyses (a minimum sensitivity of 0.91). This pipeline was evaluated using threshold values for binning derived from either Subset 1 ORFS, Subset 2 ORFS, or a combination in which Subset 1 values were applied to Step 1, and Subset 2 values were applied to Step 2. **** P < 0.0001. Error bars show standard error of the mean from 100 iterations with 100 new randomly permuted and polluted datasets of each type per iteration.
To evaluate the possible discriminatory power of ɸB124-14 relative abundance profiles and specific human-gut affiliated ORF subsets in more detail, ROC curves were constructed based on relative abundance profiles from all ɸB124-14 ORFs, as well as subsets exhibiting significantly higher representation in human gut viromes compared to other viral datasets (Figure 6b). This revealed that the cumulative relative abundance profile derived from all ɸB124-14 ORFs had potentially high diagnostic potential in terms of distinguishing uncontaminated datasets from polluted environmental viromes, but held no real diagnostic potential for the distinction of human-polluted datasets from those subject to simulated bovine or porcine contamination (Figure 6b). A comparable performance was also predicted when ROC analysis was based on ORFs with significantly increased mean relative abundance in human gut viromes compared to environmental viromes (designated Subset 1; Figure 6b). In contrast, ROC analysis based only on those ORFs exhibiting significantly higher average representation in human gut viromes compared to all other viromes analysed (designated Subset 2; Figure 6b), showed considerably greater potential for distinguishing datasets subject to human-derived pollution from non-human sources, but a reduced capacity for distinguishing polluted from unpolluted datasets in general (Figure 6b). Collectively, these analyses indicated a 2-step process utilising different ɸB124-14 ORF subsets should provide the best performance in terms of sensitivity, specificity and overall accuracy.
To test these predictions, threshold cumulative relative abundance values (minimum sensitivity of 0.91 and the highest available specificity) were selected from ROC analyses and applied to the 2-step categorisation of randomly permuted and polluted datasets (Figure 6c). In this process, datasets were first categorised as polluted or non-polluted (Step 1), and polluted datasets subsequently scrutinised further to identify those contaminated specifically with human-derived signals (Step 2). This experiment confirmed that relative abundance profiles from Subset 1 ORFs were able to distinguish polluted from unpolluted datasets with high accuracy (high sensitivity, high specificity), but performed poorly in subsequent specific identification of human-polluted datasets (high sensitivity, low specificity) (Figure 6c). In contrast, the converse was observed for categorisation based solely on Subset 2 ORFs (Figure 6c). However, a good overall performance was obtained when Subset 1 and Subset 2 relative abundance profiles were used in combination. The application of Subset 1 ORF profiles in step 1, and Subset 2 ORF profiles in step 2, resulted in a highly accurate distinction of polluted from unpolluted datasets, as well as specific identification of those contaminated by human-derived signatures (Figure 6c).
Discussion
Here we provide evidence that a distinctive, human-gut associated ecogenomic signature can extend to specific phage from the human gut virome and distinct ecogenomic signatures can be found in phage from other habitats. Our analysis, encompassing both viral and whole community metagenomic datasets covering a wide range of environments, reveals the existence of a clear human gut-associated ecogenomic signature within the Bacteroides ɸB124-14 genome (Ogilvie et al., 2012). Analysis of the representation of sequences with similarity to this phage genome clearly groups metagenomic datasets based on their environmental origin, and identified regions of the ɸB124-14 genome with the strongest human gut-affiliation. Furthermore, through an in-silico modelling approach we provide preliminary proof-of-concept, and show these gut-associated genome regions likely hold sufficient discriminatory power for the development of phage-based metagenomic MST tools.
These findings are congruent with previous smaller-scale evaluations of the ɸB124-14 ecological profile using both sequence alignments (Ogilvie et al., 2012), the tetranucleotide usage profile of the ɸB124-14 genome (Ogilvie et al., 2013), and evaluation of phage replication in gut-specific host bacteria (Ebdon et al., 2007). However, a notable difference in the present analysis was not only the increased scale, encompassing a considerably greater number and diversity of metagenomes than previous studies, but also the premise from which the ɸB124-14 genome was analysed.
We hypothesised that any gut-associated ecogenomic signature encoded by ɸB124-14 would be derived from the co-evolution of this phage and its bacterial host within the human gut, and should manifest as an increased relative abundance of sequences with similarity to ɸB124-14 encoded genes in viromes from this habitat. However, by default this gene-centric hypothesis also allows that not all ɸB124-14 genes would be subject to the same selective forces, or be expected to display the same levels of ecological success in a given viral community or host microbiome. Therefore, rather than a single unified and fixed genetic unit, we instead viewed ɸB124-14 as an assemblage of independent but associated genes, each with its own evolutionary trajectory within a given microbial community, and calculated representation in metagenomic datasets on an individual gene-by-gene basis. Exploration of the ɸB124-14 genome in this way is also more compatible with the mosaic nature and inherent plasticity of phage genomes (Brüssow et al., 2004; Hendrix et al., 1999; Hatfull and Hendrix, 2011), and stands to provide more flexibility in the use of phage sequences in the development of MST tools.
Overall, this approach allowed us to identify genes or genome regions with the strongest affiliation to the human gut microbiome in ɸB124-14, and therefore the most suitable potential targets for development of molecular or metagenome-based MST assays. Although only a general association with the mammalian gut virome (human, porcine, bovine) was initially noted in surveys of cumulative relative abundance, likely reflecting common features of these mammalian gut microbiomes (such as an abundance of Bacteroides sp.; Looft et al., 2014; Jami et al., 2013), discrete regions with more specific human-gut affiliation were resolved through more detailed analysis of the ɸB124-14 genome. Importantly, our results also show this approach is equally capable of distinguishing alternative ecogenomic signatures in other phage, or indicating the absence of any habitat-affiliation should clear ecogenomic signals not be readily identifiable in a phage genome.
This was clearly demonstrated by conducting identical analyses of phage from other environments (ɸSYN5 and ɸKS10), which are considered to have no notable association with the human gut microbiome, and displayed no human-gut related ecogenomic signature. A distinct environmental ecogenomic signature was detected in ɸSYN5 using this approach, while no discernible ecogenomic signal was apparent in ɸKS10. While ɸSYN5 observations are in keeping with the habitat of its bacterial host, the lack of any detectable ecological affiliation in ɸKS10 likely reflects the paucity of available datasets covering terrestrial habitats relevant to this bacteriophage, and the overall “healthy” status of volunteers from which human metagenomes were derived. It is also possible that the temperate nature of ɸKS10 may contribute to the lack of a detectable ecogenomic profile, but the use of whole community metagenomes should compensate for this aspect of the ɸKS10 lifestyle. Collectively, analysis of both ɸSYN5 and ɸKS10 provide further support for the hypothesis that relative abundance profiles of genes similar to ɸB124-14 ORFs in metagenomic datasets are indeed reflective of a gut-related ecogenomic signal.
Congruent with the concept of ɸB124-14 as a collective of genes with independent evolutionary trajectories, was the clear variability in gut affiliation of individual ORFs evident across the ɸB124-14 genome. Notably, no strong representation in any habitat was observed for some genes, while some aspects of the ɸB124-14 functional repertoire (the majority related to DNA regulation and replication) were indicated to be conserved across multiple disparate environments. Examples of similar highly cosmopolitan genes were also identified in ɸSYN5 and ɸKS10, and phage-encoded genes with broad environmental distribution have been reported in other studies (Breitbart et al., 2004; Breitbart and Rohwer, 2005; Roux et al., 2016; Manrique et al., 2016), suggesting these may be relatively common within phage genomes. These cosmopolitan genes were counterbalanced by genes that showed a seemingly more provincial, gut-specific representation. Taken together, these observations are compatible with the notion that the abundance of genes similar to particular ɸB124-14 ORFs in human gut datasets reflects environmental selection on a gene-by-gene basis (Brito et al., 2016), the extant features of the human gut virome in terms of dominance of temperate phage and an intimate role for phage in community function and stability [reviewed in (Ogilvie and Jones, 2015)].
Using the ɸB124-14 relative abundance profile to “contaminate” viral datasets of environmental origin, also permitted crude in silico simulations of human faecal pollution, and modelling of how MST tools based on bacteriophage ecogenomic profiles and gut-affiliated phage gene subsets may conceivably operate. In these experiments we focused on viral metagenomes specifically due to the clear segregation of viromes in nMDS ordinations, and the proposed advantages of phage in MST applications (Gómez-Doñate et al., 2011; Lee et al., 2009). For initial evaluations (Figure 4) the choice of environmental viral datasets “polluted” was focused on those most likely to be already impacted by human activity and/or with a strong innate background environmental signal (e.g. temperate marine environments, coastal waters near major population centres and reclaimed water). The datasets selected therefore encompassed environmental viromes exhibiting the highest background ɸB124-14 cumulative relative abundance profiles, to provide a conservative and stringent evaluation of the potential for the ɸB124-14 gut-associated ecogenomic signal to distinguish polluted from uncontaminated environmental datasets. In addition, the degree to which the applied human-derived signal was diluted in these experiments was congruent with that observed for other indicators of pollution during events such as Combined Sewer Overflows (Madoux-Humery et al., 2013; Passerat et al., 2011).
Using the ɸB124-14 relative abundance profile to “contaminate” viral datasets of environmental origin, also permitted crude in silico simulations of human faecal pollution, and modelling of how MST tools based on bacteriophage ecogenomic profiles and gut-affiliated phage gene subsets may conceivably operate. In these experiments we focused on viral metagenomes specifically due to the clear segregation of viromes in nMDS ordinations, and the proposed advantages of phage in MST applications (Gómez-Doñate et al., 2011; Lee et al., 2009). For initial evaluations (Figure 4) the choice of environmental viral datasets “polluted” was focused on those most likely to be already impacted by human activity and/or with a strong innate background environmental signal (e.g. temperate marine environments, coastal waters near major population centres and reclaimed water). The datasets selected therefore encompassed environmental viromes exhibiting the highest background ɸB124-14 cumulative relative abundance profiles, to provide a conservative and stringent evaluation of the potential for the ɸB124-14 gut-associated ecogenomic signal to distinguish polluted from uncontaminated environmental datasets. In addition, the degree to which the applied human-derived signal was diluted in these experiments was congruent with that observed for other indicators of pollution during events such as Combined Sewer Overflows (Madoux-Humery et al., 2013; Passerat et al., 2011).
This evaluation demonstrated that the separation of polluted environmental datasets towards human gut viromes was in proportion to the strength of the introduced human-gut related ɸB124-14 signal. Expansion of this in silico modelling approach using a wider range of randomly permuted and polluted environmental profiles, and more focused ɸB124-14 ORF subsets indicated to have the greatest diagnostic power in ROC analyses, further demonstrated that the relative abundance of ɸB124-14 ORFs within different viromes can potentially distinguish those specifically contaminated with human-derived ecogenomic profiles with high accuracy. The levels of sensitivity and specificity achieved during these simulations was comparable to those reported for a wide range of qPCR based methods using multiple or combined bacterial or viral gene targets (reviewed in Harwood et al., 2014).
Although the in silico modelling undertaken here affords only a very basic and simplistic simulation of pollution and the use of phage ecogenomic signatures for MST, these experiments nonetheless provide an initial proof-of-concept that viral metagenomic datasets can be distinguished in this way, and supports the possibility for development of new MST methods based on these concepts. Moreover, it should be noted that modelling undertaken here was based on only a single phage ecogenomic profile, and using only basic abundance thresholds to discriminate datasets. The metagenomic approach opens the potential to simultaneously utilise a large number of indicators derived from many phage, and move beyond simple abundance based thresholds. The inclusion of further phage ecogenomic signatures, coupled with the development of more powerful diagnostic algorithms should further enhance performance of these approaches. Our use of different subsets of ɸB124-14 ORFs in distinct stages of dataset categorization during simulations, also serves to highlight some of the advantages of metagenomic approaches to MST.
Furthermore, unlike qPCR and other direct molecular biology assays, metagenomics can capture information on an almost unlimited array of genes present in a sample, while emphasis is placed on the analysis of sequence data to provide the actual diagnostic test. Because of this, once an initial metagenomic strategy for sampling and generation of sequence data has been developed, the cost, time, and labour involved in continual adaptation and improvement of assays is considerably reduced. Modelling of new strategies is also readily implemented, and performance of multiple distinct algorithms or new “tests” may be compared directly in parallel on the same samples and datasets, without compromising results of ongoing source tracking activities. This should provide considerable flexibility in the design, implementation, and continued improvement of metagenome-based MST tools, and as new information and targets are identified these may be easily evaluated on historical data with established provenance, and incorporated into the MST pipeline without altering the basic sampling and sequencing protocols. It should also be noted that the generation of sequence data from samples is also no longer a major barrier to implementing such approaches. Fully portable and affordable sequencing platforms such as the MinION from Oxford Nanopore Technologies are commercially available, and have been used in the field for metagenomics analysis in habitats ranging from the Arctic Tundra to the International Space Station.
Nevertheless, care must be taken not to over interpret the results presented here, which should be considered in the context of the limitations and potential biases within existing metagenomic datasets, the relatively simplistic and crude modelling undertaken, as well as the relatively poor representation of most habitats afforded by the metagenomic datasets available. Metagenomes analysed here were drawn from a variety of sources, and vary in terms of construction methods, community coverage, assembly status, sample sizes, and sample numbers. Because of this, the simple relative abundance approach used here intentionally employs more permissive criteria for identifying sequences with similarity to target sequences, to reduce the impact of these methodological variations and provide a conservative and robust comparison between datasets. This strategy seeks to identify general patterns in relative representation of broad functions between datasets rather than identical genes or sequences, with normalisation for differing depths of sequencing between datasets, and has previously been shown to enable useful comparison of metagenomes generated by different approaches (Jones et al., 2008, 2010; Ogilvie et al., 2012; Ogilvie and Jones, 2012; Ogilvie et al., 2013). Furthermore, the use of more permissive criteria in the relative abundance analyses were also intended to provide a more robust and conservative test of the phage ecogenomic signature hypothesis. In essence, these criteria should maximise the detection of conflicting non-specific signals in non-target datasets, meaning that distinct phage ecogenomic profiles need to be discernible against a higher level of background “noise” to be identified in this analysis.
The utility of this approach was also supported in the present study, in which available datasets were shown to form cohesive and well-defined groups based on habitat in nMDS ordinations. Notable examples include conventional human gut metagenomes produced using distinct metagenomic techniques and sequencing methods (Nelson et al., 2010; Qin et al., 2010; Reyes et al., 2010; Gill et al., 2006; Kurokawa et al., 2007), which were clearly localised to a cohesive group. Comparison of assembled and unassembled versions of the same human gut viral datasets in these experiments, also confirmed that assembly should have only minimal impact on the overall results obtained, and did not obscure the habitat-derived ecological signatures present in these metagenomes, or the distinction between viral and whole community datasets. Overall, available evidence suggests that the approaches we have used to compare datasets permit identification of genuine differences based on relative gene abundance and provide meaningful insight into habitat-associated features of these metagenomes.
Of more concern are the relatively small numbers of samples and datasets available for all habitats, most notably viromes and non-human gut whole community datasets. This is exacerbated by the high inter-individual variability noted in human viral metagenomes used here and in other studies (Minot et al., 2011; Ogilvie et al., 2013), but in practice for human gut viromes this variation is likely to be offset to some degree by the fact that MST will be based on aggregate gut microbiome outputs from human populations as a whole, rather than individual microbiomes. However, a distinct geographic variation is also believed to exist in the human gut microbiome (Yadav et al., 2016; Suzuki and Worobey, 2014; Ogilvie et al., 2012), and culture-based approaches utilising gut-associated phage infecting Bacteroides species have already highlighted the possible need to develop region-specific MST tools (Payan et al., 2005; Harwood et al., 2014). Although here and in other studies, whole community human gut datasets derived from individuals from disparate geographic locations (Gill et al., 2006; Qin et al., 2010; Kurokawa et al., 2007) were found to still group clearly based on habitat in higher level analyses, the human gut viromes we analysed are derived exclusively from individuals residing in the USA, and so provide little insight into possible geographical effects. Moreover, the geographic variation in gut virome composition has yet to be subject to the same level of scrutiny directed towards the bacterial component of this ecosystem. In addition, the number of viral particles, derived levels of nucleic acids and details of sampling and processing methods that may provide a useful lower limit from which diagnostic relative abundance profiles can be calculated, remain to be determined. Further large-scale studies will be required to address these questions, fully test the hypotheses presented here, and fully examine the potential for phage-based metagenomic MST tools derived from these ecogenomic concepts. This will not only entail the generation and use of a more extensive collection of viral metagenomes from relevant sources but also the isolation and characterisation of further phage genomes from these habitats, including identification of those with ecogenomic signatures that may be utilised and incorporated into phage-based MST approaches.
In essence, the gene pool of a given microbial community adapts over time reflecting the challenges of life in a given habitat, as well as the ancestry of community members (Ley et al., 2006). Here we provide evidence that this may also manifest as a bias within the viral gene pool of particular microbiomes, forming the basis for a habitat-related ecogenomic signature, which can also be detected in individual member phage. In summary, the work presented here provides new fundamental insights into phage ecology that could support the development of a novel range of highly specific, sensitive, rapid, and portable phage-based metagenomic MST tools.
Methods
Cumulative relative abundance of genes with similarity to phage encoded Open Reading Frames
The representation of sequences with similarity to phage-encoded functions and calculation of cumulative relative gene representation between datasets was performed as previously described (Ogilvie et al., 2012; Jones, 2010; Jones et al., 2008), but with the following modifications: Unassembled viral datasets were surveyed by mapping raw sequencing reads to translated ɸB124-14, ɸSYN5 or ɸKS10 ORFs using BlastX. Assembled whole community metagenomes and assembled viral datasets were searched using tBlastn with amino acid sequences from each predicted phage ORF. For both dataset types, valid hits were considered to be those generating ≥35% identity over ≥50% of the query sequence and an e-value of ≤ 1e-5. Valid hits were used to calculate the relative abundance of each phage-encoded ORF in each dataset (expressed as Hits/Mb of sequence data). The cumulative relative abundance of ORFs encoded by each phage was taken as the sum of all individual ORF relative abundances. Blast searches and calculation of relative abundance were automated using a custom PERL script (access and support is freely available on request to authors), which implemented BLAST v2.2.29 with default settings, searched custom Blast databases generated from each metagenomic dataset, processed BLAST outputs to identify valid hits based on criteria above, and calculated relative abundance for each phage ORF in each metagenomic dataset. Data was saved as *.csv format files and imported into Microsoft XL for further analysis. Significant differences in cumulative relative abundances between metagenomes were assessed using the Kruskall-Wallis test with Dunn’s correction for multiple comparisons. Statistical analyses and generation of scatterplots was performed in Graphpad Prism 6.0 for Mac OS X.
Unsupervised ordination of metagenomic datasets based on phage-related ecogenomic profiles
Ordination of metagenomes was performed using the Vegan package (v2.4) (Oksanen et al., 2016) in R to conduct non-metric multidimensional scaling (nMDS) (Clarke, 1993) and Analysis of Similarities (ANOSIM) (Clarke, 1993), using the metaMDS and anosim functions respectively. For nMDS and ANOSIM, individual gene relative abundance profiles for each phage in each metagenomic dataset (calculated as described above) were used and only datasets exhibiting sequences with similarity to at least 2 distinct ORFs per phage (i.e. a minimum of 2 valid hits to distinct ORFs in BLAST searches) were included. Relative abundance data was square root transformed, before being used to construct Bray-Curtis distance matrices (Vegan package in R), and then for nMDS (with a minimum of 1000 random starts). Square root transformed data was used directly without further processing for ANOSIM analyses, which calculated the level and significance of separation between defined groups of metagenomes based on habitat of origin. The ANOSIM R statistic indicates increasing separation of groups as values approach 1, while statistical significance is provided by an associated P value. Graphical representations of nMDS ordinations were produced using Vegan ordiplot functions in R. ANOSIM data was visualised using Graphpad Prism 6.0 for OS X.
In silico simulation of human faecal pollution in environmental datasets
Contamination of environmental datasets with human pollution was simulated by addition of the ɸB124-14 human gut virome ecogenomic signature to selected environmental viromes. The average relative abundance of each ɸB124-14 ORF within human gut viromes (Reyes et al., 2010) (n=12) was added to the corresponding ɸB124-14 ORF relative abundance in selected environmental viromes on a gene-by-gene basis, at “strengths” ranging from 100% to 0.01%. The viromes subjected to this simulated human faecal pollution, were selected based on those most likely to be already impacted by human activity, and/or contain a strong innate background environmental signal distinct from that of the gut microbiome [Bay of British Columbia, Sargasso Sea, Gulf of Mexico (Angly et al., 2006), Tampa Bay (McDaniel et al., 2008), and Reclaimed Water (Rosario et al., 2009)]. The ability of ɸB124-14 human gut ecogenomic signals to discriminate polluted environmental datasets from original uncontaminated datasets, was evaluated using nMDS ordination and ANOSIM, as described above.
Identification of regions of the ɸB124-14 with the strongest ecogenomic signal
The variation in the “strength” of the human gut-associated ɸB124-14 ecogenomic signal across the phage genome and representation in datasets from distinct environmental groups, was assessed by transforming all relative abundance values by addition of a small positive value (y + 0.00001), before conversion to Log10 hits/MB DNA. Differences in relative abundance within human gut viromes or ɸB124-14 ORFs was compared to profiles observed in bovine and porcine gut viromes, environmental viromes, as well as whole community human gut and environmental metagenomes. Significant differences between the relative representation of ɸB124-14 ORFs in human gut viromes compared to other datasets was determined using the Kruskall-Wallis test with Dunn’s correction for multiple comparisons, in Graphpad Prism 6 for OS X.
Simulation and modelling of virome-based source tracking using ɸB124-14 ecogenomic signatures
The use of ɸB124-14 relative abundance profiles for microbial source tracking was evaluated using a Monte Carlo based simulation with uniform probability distribution input, derived from the maximum baseline relative abundance values for each ɸB124-14 ORF across all environmental viral metagenomes. In these simulations, permutations of environmental ɸB124-14 relative abundance profiles were generated through random variation of each ORF relative abundance value, ranging from 0 to the maximum value observed for a given ORF across all environmental viromes. Copies of randomly permuted environmental viromes were subsequently subjected to simulated in silico pollution through addition of average human, bovine, or porcine ɸB124-14 relative abundance profiles, at randomly selected signal strengths ranging from 0-100%. In each iteration 100 randomly permuted environmental viromes were created and used to generate 100 randomly polluted datasets of each type (human, bovine, porcine). Data from a single iteration was used to visualise relationships between datasets using nMDS and ANOSIM as described for unsupervised ordination of metagenomic datasets above, and also to construct ROC curves based on cumulative relative abundance profiles for either all ORFs, or subsets found to be significantly increased in relative abundance compared to other datasets (See Figure 5c). Data from all iterations were used to evaluate the performance of cumulative relative abundance thresholds in accurately identifying human polluted datasets in a 2-step binning process, based on threshold values derived from ROC analyses. Step 1 was used to categorize datasets as either polluted or non-polluted. In Step 2 datasets categorised as polluted in Step 1 were sorted further into ‘Human-polluted’ and ‘Non-human polluted’ categories, using a second threshold value from ROC analyses. Threshold values were selected to achieve the best possible sensitivity and specificity, but with a minimum sensitivity of 0.91. ROC analysis and statistical comparisons of performance of ORF combinations in categorizing datasets (ANOVA with Bonferroni correction) were conducted using Graph Pad Prism for OS X.
Supplementary Material
Supplementary information is available at The ISME Journal’s website
Acknowledgements
This work was supported by funding from the University of Brighton (Research Challenges Scheme awarded to Dr. Lesley A Ogilvie). Dr. Brian V Jones and Dr. Jonathan Nzakizwanayo were also supported by funding from the Medical Research Council (G0901553; MR/N006496/1), and the Society for Applied Microbiology The authors would also like to thank Mr Simon Booth and Dr Caroline Jones for constructive criticism and expert advice.
Footnotes
Competing Financial Interests
The author(s) declare no competing financial interests.
References
- Ahmed W, Hughes B, Harwood V. Current Status of Marker Genes of Bacteroides and Related Taxa for Identifying Sewage Pollution in Environmental Waters. Water. 2016;8:231. [Google Scholar]
- Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA, Carlson C, et al. The marine viromes of four oceanic regions. PLoS Biol. 2006;4 doi: 10.1371/journal.pbio.0040368. e368. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Breitbart M, Miyake JH, Rohwer F. Global distribution of nearly identical phage-encoded DNA sequences. FEMS Microbiol Lett. 2004;236:249–56. doi: 10.1016/j.femsle.2004.05.042. [DOI] [PubMed] [Google Scholar]
- Breitbart M, Rohwer F. Here a virus, there a virus, everywhere the same virus? Trends Microbiol. 2005;13:278–84. doi: 10.1016/j.tim.2005.04.003. [DOI] [PubMed] [Google Scholar]
- Brito IL, Yilmaz S, Huang K, Xu L, Jupiter SD, Jenkins AP, et al. Mobile genes in the human microbiome are structured from global to individual scales. Nature. 2016;535:435–9. doi: 10.1038/nature18927. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brüssow H, Canchaya C, Hardt W-D. Phages and the evolution of bacterial pathogens: from genomic rearrangements to lysogenic conversion. Microbiol Mol Biol Rev. 2004;68:560–602. doi: 10.1128/MMBR.68.3.560-602.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clarke K. Non-parametric multivariate analyses of changes in community structure. Aust J Ecol. 1993;18:117–143. [Google Scholar]
- Ebdon J, Muniesa M, Taylor H. The application of a recently isolated strain of Bacteroides (GB-124) to identify human sources of faecal pollution in a temperate river catchment. Water Res. 2007;41:3683–90. doi: 10.1016/j.watres.2006.12.020. [DOI] [PubMed] [Google Scholar]
- Gill SR, Pop M, Deboy RT, Eckburg PB, Turnbaugh PJ, Samuel BS, et al. Metagenomic analysis of the human distal gut microbiome. Science. 2006;312:1355–9. doi: 10.1126/science.1124234. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gómez-Doñate M, Casanovas-Massana A, Muniesa M, Blanch AR. Development of new host-specific Bacteroides qPCRs for the identification of fecal contamination sources in water. Microbiologyopen. 2016;5:83–94. doi: 10.1002/mbo3.313. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gómez-Doñate M, Payán A, Cortés I, Blanch AR, Lucena F, Jofre J, et al. Isolation of bacteriophage host strains of Bacteroides species suitable for tracking sources of animal faecal pollution in water. Environ Microbiol. 2011;13:1622–31. doi: 10.1111/j.1462-2920.2011.02474.x. [DOI] [PubMed] [Google Scholar]
- Goudie AD, Lynch KH, Seed KD, Stothard P, Shrivastava S, Wishart DS, et al. Genomic sequence and activity of KS10, a transposable phage of the Burkholderia cepacia complex. BMC Genomics. 2008;9:615. doi: 10.1186/1471-2164-9-615. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Griffith JF, Cao Y, McGee CD, Weisberg SB. Evaluation of rapid methods and novel indicators for assessing microbiological beach water quality. Water Res. 2009;43:4900–7. doi: 10.1016/j.watres.2009.09.017. [DOI] [PubMed] [Google Scholar]
- Haack SK, Fogarty LR, Wright C. Escherichia coli and enterococci at beaches in the Grand Traverse Bay, Lake Michigan: sources, characteristics, and environmental pathways. Environ Sci Technol. 2003;37:3275–82. doi: 10.1021/es021062n. [DOI] [PubMed] [Google Scholar]
- Harwood VJ, Boehm AB, Sassoubre LM, Vijayavel K, Stewart JR, Fong T-T, et al. Performance of viruses and bacteriophages for fecal source determination in a multi-laboratory, comparative study. Water Res. 2013;47:6929–43. doi: 10.1016/j.watres.2013.04.064. [DOI] [PubMed] [Google Scholar]
- Harwood VJ, Staley C, Badgley BD, Borges K, Korajkic A. Microbial source tracking markers for detection of fecal contamination in environmental waters: relationships between pathogens and human health outcomes. FEMS Microbiol Rev. 2014;38:1–40. doi: 10.1111/1574-6976.12031. [DOI] [PubMed] [Google Scholar]
- Hatfull GF, Hendrix RW. Bacteriophages and their genomes. Curr Opin Virol. 2011;1:298–303. doi: 10.1016/j.coviro.2011.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hendrix RW, Smith MC, Burns RN, Ford ME, Hatfull GF. Evolutionary relationships among diverse bacteriophages and prophages: all the world’s a phage. Proc Natl Acad Sci U S A. 1999;96:2192–7. doi: 10.1073/pnas.96.5.2192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ishii S, Yan T, Shively DA, Byappanahalli MN, Whitman RL, Sadowsky MJ. Cladophora (Chlorophyta) spp. harbor human bacterial pathogens in nearshore water of Lake Michigan. Appl Environ Microbiol. 2006;72:4545–53. doi: 10.1128/AEM.00131-06. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jami E, Israel A, Kotser A, Mizrahi I. Exploring the bovine rumen bacterial community from birth to adulthood. ISME J. 2013;7:1069–79. doi: 10.1038/ismej.2013.2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jofre J, Blanch AR, Lucena F, Muniesa M. Bacteriophages infecting Bacteroides as a marker for microbial source tracking. Water Res. 2014;55:1–11. doi: 10.1016/j.watres.2014.02.006. [DOI] [PubMed] [Google Scholar]
- Jones BV. The human gut mobile metagenome: a metazoan perspective. Gut Microbes. 2010;1:415–31. doi: 10.4161/gmic.1.6.14087. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jones BV, Begley M, Hill C, Gahan CGM, Marchesi JR. Functional and comparative metagenomic analysis of bile salt hydrolase activity in the human gut microbiome. Proc Natl Acad Sci U S A. 2008;105:13580–5. doi: 10.1073/pnas.0804437105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jones BV, Sun F, Marchesi JR. Comparative metagenomic analysis of plasmid encoded functions in the human gut microbiome. BMC Genomics. 2010;11:46. doi: 10.1186/1471-2164-11-46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Knights D, Kuczynski J, Charlson ES, Zaneveld J, Mozer MC, Collman RG, et al. Bayesian community-wide culture-independent microbial source tracking. Nat Methods. 2011;8:761–3. doi: 10.1038/nmeth.1650. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kurokawa K, Itoh T, Kuwahara T, Oshima K, Toh H, Toyoda A, et al. Comparative metagenomics revealed commonly enriched gene sets in human gut microbiomes. DNA Res. 2007;14:169–81. doi: 10.1093/dnares/dsm018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leclerc H, Mossel DA, Edberg SC, Struijk CB. Advances in the bacteriology of the coliform group: their suitability as markers of microbial water safety. Annu Rev Microbiol. 2001;55:201–34. doi: 10.1146/annurev.micro.55.1.201. [DOI] [PubMed] [Google Scholar]
- Lee JE, Lim MY, Kim SY, Lee S, Lee H, Oh H-M, et al. Molecular characterization of bacteriophages for microbial source tracking in Korea. Appl Environ Microbiol. 2009;75:7107–14. doi: 10.1128/AEM.00464-09. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ley RE, Peterson Da, Gordon JI. Ecological and evolutionary forces shaping microbial diversity in the human intestine. Cell. 2006;124:837–48. doi: 10.1016/j.cell.2006.02.017. [DOI] [PubMed] [Google Scholar]
- Looft T, Allen HK, Cantarel BL, Levine UY, Bayles DO, Alt DP, et al. Bacteria, phages and pigs: the effects of in-feed antibiotics on the microbiome at different gut locations. ISME J. 2014;8:1566–1576. doi: 10.1038/ismej.2014.12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Madoux-Humery A-S, Dorner S, Sauvé S, Aboulfadl K, Galarneau M, Servais P, et al. Temporal variability of combined sewer overflow contaminants: evaluation of wastewater micropollutants as tracers of fecal contamination. Water Res. 2013;47:4370–82. doi: 10.1016/j.watres.2013.04.030. [DOI] [PubMed] [Google Scholar]
- Manrique P, Bolduc B, Walk ST, van der Oost J, de Vos WM, Young MJ. Healthy human gut phageome. Proc Natl Acad Sci U S A. 2016;113:10400–5. doi: 10.1073/pnas.1601060113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McDaniel L, Breitbart M, Mobberley J, Long A, Haynes M, Rohwer F, et al. Metagenomic analysis of lysogeny in Tampa Bay: implications for prophage gene expression. Butler G, editor. PLoS One. 2008;3:e3263. doi: 10.1371/journal.pone.0003263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McLellan SL, Salmore AK. Evidence for localized bacterial loading as the cause of chronic beach closings in a freshwater marina. Water Res. 2003;37:2700–8. doi: 10.1016/S0043-1354(03)00068-X. [DOI] [PubMed] [Google Scholar]
- McMinn BR, Korajkic A, Ashbolt NJ. Evaluation of Bacteroides fragilis GB-124 bacteriophages as novel human-associated faecal indicators in the United States. Lett Appl Microbiol. 2014;59:115–21. doi: 10.1111/lam.12252. [DOI] [PubMed] [Google Scholar]
- Minot S, Sinha R, Chen J, Li H, Keilbaugh SA, Wu GD, et al. The human gut virome: Inter-individual variation and dynamic response to diet. Genome Res. 2011;21:1616–25. doi: 10.1101/gr.122705.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nelson KE, Weinstock GM, Highlander SK, Worley KC, Creasy HH, Wortman JR, et al. A Catalog of Reference Genomes from the Human Microbiome. Science (80- ) 2010;328:994–999. doi: 10.1126/science.1183605. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Newton RJ, Bootsma MJ, Morrison HG, Sogin ML, McLellan SL. A microbial signature approach to identify fecal pollution in the waters off an urbanized coast of Lake Michigan. Microb Ecol. 2013;65:1011–23. doi: 10.1007/s00248-013-0200-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ogilvie LA, Bowler LD, Caplin J, Dedi C, Diston D, Cheek E, et al. Genome signature-based dissection of human gut metagenomes to extract subliminal viral sequences. Nat Commun. 2013;4:2420. doi: 10.1038/ncomms3420. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ogilvie LA, Caplin J, Dedi C, Diston D, Cheek E, Bowler L, et al. Comparative (meta)genomic analysis and ecological profiling of human gut-specific bacteriophage ɸB124-14. PLoS One. 2012;7:e35053. doi: 10.1371/journal.pone.0035053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ogilvie LA, Jones BV. Dysbiosis modulates capacity for bile acid modification in the gut microbiomes of patients with inflammatory bowel disease: a mechanism and marker of disease? Gut. 2012;61:1642–1643. doi: 10.1136/gutjnl-2012-302137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ogilvie LA, Jones BV. The human gut virome: a multifaceted majority. Front Microbiol. 2015;6:918. doi: 10.3389/fmicb.2015.00918. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oksanen J, Guillaume Blanchet F, Friendly M, Kindt R, Legendre P, McGlinn D, Minchin PR, O’Hara RB, Simpson GL, Solymos P, Stevens MH, et al. vegan: Community Ecology Package. R package version 2.4-0. 2016 https://cran.r-project.org/package=vegan.
- Passerat J, Ouattara NK, Mouchel J-M, Rocher V, Servais P. Impact of an intense combined sewer overflow event on the microbiological water quality of the Seine River. Water Res. 2011;45:893–903. doi: 10.1016/j.watres.2010.09.024. [DOI] [PubMed] [Google Scholar]
- Payan A, Ebdon J, Taylor H, Gantzer C, Ottoson J, Papageorgiou GT, et al. Method for Isolation of Bacteroides Bacteriophage Host Strains Suitable for Tracking Sources of Fecal Pollution in Water. Appl Environ Microbiol. 2005;71:5659–5662. doi: 10.1128/AEM.71.9.5659-5662.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pedulla ML, Ford ME, Houtz JM, Karthikeyan T, Wadsworth C, Lewis JA, et al. Origins of highly mosaic mycobacteriophage genomes. Cell. 2003;113:171–82. doi: 10.1016/s0092-8674(03)00233-2. [DOI] [PubMed] [Google Scholar]
- Pope WH, Weigele PR, Chang J, Pedulla ML, Ford ME, Houtz JM, et al. Genome Sequence, Structural Proteins, and Capsid Organization of the Cyanophage Syn5: A ‘Horned’ Bacteriophage of Marine Synechococcus. J Mol Biol. 2007;368:966–981. doi: 10.1016/j.jmb.2007.02.046. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, et al. A human gut microbial gene catalogue established by metagenomic sequencing: Commentary. Nature. 2010;464:28. doi: 10.1038/nature08821. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reyes A, Haynes M, Hanson N, Angly FE, Heath AC, Rohwer F, et al. Viruses in the faecal microbiota of monozygotic twins and their mothers. Nature. 2010;466:334–338. doi: 10.1038/nature09199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rosario K, Nilsson C, Lim YW, Ruan Y, Breitbart M. Metagenomic analysis of viruses in reclaimed water. Environ Microbiol. 2009;11:2806–2820. doi: 10.1111/j.1462-2920.2009.01964.x. [DOI] [PubMed] [Google Scholar]
- Roux S, Brum JR, Dutilh BE, Sunagawa S, Duhaime MB, Loy A, et al. Ecogenomics and potential biogeochemical impacts of globally abundant ocean viruses. Nature. 2016;537:689–693. doi: 10.1038/nature19366. [DOI] [PubMed] [Google Scholar]
- Seed KD, Dennis JJ. Isolation and characterization of bacteriophages of the Burkholderia cepacia complex. FEMS Microbiol Lett. 2005;251:273–280. doi: 10.1016/j.femsle.2005.08.011. [DOI] [PubMed] [Google Scholar]
- Shanks OC, Newton RJ, Kelty CA, Huse SM, Sogin ML, McLellan SL. Comparison of the microbial community structures of untreated wastewaters from different geographic locales. Appl Environ Microbiol. 2013;79:2906–13. doi: 10.1128/AEM.03448-12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Suzuki TA, Worobey M. Geographical variation of human gut microbial composition. Biol Lett. 2014;10 doi: 10.1098/rsbl.2013.1037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tan B, Ng C, Nshimyimana JP, Loh LL, Gin KY-H, Thompson JR. Next-generation sequencing (NGS) for assessment of microbial water quality: current progress, challenges, and future opportunities. Front Microbiol. 2015;6:1027. doi: 10.3389/fmicb.2015.01027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Unno T, Di DYW, Jang J, Suh YS, Sadowsky MJ, Hur H-G. Integrated online system for a pyrosequencing-based microbial source tracking method that targets Bacteroidetes 16S rDNA. Environ Sci Technol. 2012;46:93–8. doi: 10.1021/es201380c. [DOI] [PubMed] [Google Scholar]
- Unno T, Jang J, Han D, Kim JH, Sadowsky MJ, Kim O-S, et al. Use of barcoded pyrosequencing and shared OTUs to determine sources of fecal bacteria in watersheds. Environ Sci Technol. 2010;44:7777–82. doi: 10.1021/es101500z. [DOI] [PubMed] [Google Scholar]
- Whitman RL, Shively DA, Pawlik H, Nevers MB, Byappanahalli MN. Occurrence of Escherichia coli and enterococci in Cladophora (Chlorophyta) in nearshore water and beach sand of Lake Michigan. Appl Environ Microbiol. 2003;69:4714–9. doi: 10.1128/AEM.69.8.4714-4719.2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yadav D, Ghosh TS, Mande SS. Global investigation of composition and interaction networks in gut microbiomes of individuals belonging to diverse geographies and age-groups. Gut Pathog. 2016;8:17. doi: 10.1186/s13099-016-0099-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.






