Abstract
An important fraction of microbial diversity is harbored in strain individuality, so identification of conspecific bacterial strains is imperative for improved understanding of microbial community functions. Limitations in bioinformatics and sequencing technologies have to date precluded strain identification owing to difficulties in phasing short reads to faithfully recover the original strain-level genotypes, which have highly similar sequences. We present ConStrains, an open-source algorithm that identifies conspecific strains from metagenomic sequence data and reconstructs the phylogeny of these strains in microbial communities. The algorithm uses single-nucleotide polymorphism (SNP) patterns in a set of universal genes to infer within-species structures that represent strains. Applying ConStrains to simulated and host-derived data sets provides insights into microbial community dynamics.
Understanding how individual organisms co-exist within a microbial community is crucial to understanding community functions. For example, the study of microbial community dynamics is important in human health, including how to maintain or restore a healthy human microbiome. Metagenomics has revolutionized microbiology by addressing some of these issues in a culture-independent manner. However, state-of-the-art metagenomics approaches are often limited to the species level1–3 or to partially assembled population consensus genomes4–6. Evidence that the unit of microbial action can fall below the species level comes from multiple sources, including culturing7, single-cell genomics8, redundant bacterial 16S rRNA gene sequencing9, internal transcribed spacer sequencing10, multilocus sequence typing11, and high-resolution genomic variation12. Therefore methods that enable strain resolution from metagenomics datasets are desirable.
Most existing culture-free approaches to identify bacterial strains in communities-have drawbacks that have limited wide adoption. For example, single-cell sequencing requires expensive and laborious efforts in cell sorting and suspension so that analyzing a large community using this approach is not done. Similarly, Hi-C, a sequencing-based approach13, requires extra steps and budget for cross-linking, library construction, and sequencing. Strain typing methods leveraging strain-level gene copy number variations14 or strain-level phylogenetic marker SNPs such as canSNPs15, PathoScope16, and Sigma17 rely on the availability of complete reference strain genomes and, with current limitations on these resources, run into challenges when studying the broader diversity found using metagenomic sequencing approaches. An assembly-based approach is dependent on several factors, including genome structure and intra-species divergence. With rare exceptions, assemblers usually fail to produce individual strain assemblies, instead creating either highly fragmented contigs or contigs that only represent population consensus sequences18,19; a recent effort in using variation-aware contig graphs for strain identification20 relies on manual inspection and hence its accuracy is subject to users’ experience. In all of these approaches, only a relatively small fraction of strain genomes have been successfully analyzed, and their distribution is usually biased21. On the other hand, methods based on single marker genes such as the 16S rRNA gene often lack the resolution to reliably capture intra-specific genomic differences22.
To overcome this difficulty and increase the utility of metagenome dataset, we developed ConStrains (Conspecific Strains), an algorithm that exploits the polymorphism patterns in a set of universal bacterial and archaeal genes to infer strain-level structures in species populations. Using both in silico and previously published host-derived datasets we show that ConStrains recovers intra-specific strain profiles and phylogeny with high accuracy, and captures important features of community dynamics including dominant strain switches and rare strains. The simulated data sets address performance in the context of different within-population diversities, different numbers of strains, the interference from other species within the same community, as well as the scalability of the method using a large in silico cohort with 322 samples. Predicted within-species structures as well as the strain genotypes were highly accurate across these simulated datasets. Applying this method to an infant gut development metagenomic data set reveals new insights of strain dynamics with functional relevance. ConStrains is implemented in Python, and the source code is available with this paper (Supplementary Code) and freely available together with full documentation at https://bitbucket.org/luo-chengwei/constrains.
RESULTS
The ConStrains algorithm
Guided by reference species, the ConStrains algorithm compares raw metagenomic reads to reference genomes and identifies patterns in SNPs as the basis for differentiation and quantification of conspecific strains. This approach is fundamentally different from other reference-dependent methods such as Sigma and PathoScope 16,17, because, unlike these methods, using ConStrains can provide reliable predictions for those species with only one genome (complete or draft), as opposed to approaches that rely on availability of a comprehensive reference strain collection. For confident SNP calling, a species requires a minimum of tenfold coverage (Supplementary Fig. 1) within or across all samples considered, which is obtained for all species with a relative abundance of >1% at typical sequencing depths of 5 Gbp. When applied to multiple samples, for example a longitudinal time series or otherwise related samples, strain identities can be traced across the different samples. The algorithm achieves this in two operations: (1) identifying species for which SNPs are detected and quantified, and (2) transforming individual SNPs into SNP profiles representing individual strains.
The first operation is a two-step process. Because the algorithm identifies strains only for those species with sufficient sequencing depth (≥10-fold coverage in at least one sample; Supplementary Fig. 1), the first step uses MetaPhlAn1 for rapid species composition profiling. For those species with sufficient sequencing depth, a custom database of marker genes is created from the comprehensive PhyloPhlAn marker set23, against which the raw reads are mapped using Bowtie224. This targeted approach allows for optimized time and computational efficiency. Resulting marker gene alignments are processed with SAMtools25 to generate a table of coverage by base position from which SNPs are identified. It is important to note that in this process the reference sequences are removed and SNPs are identified de novo to minimize reference dependency (Fig. 1a–d and Online Methods). We verified that such a SNP selection procedure is sufficiently accurate and uniquely sensitive to disentangle intra-specific diversity (Supplementary Note 1 and Supplementary Fig. 2).
In the second operation, individual SNPs are combined into unique SNP profiles from which strains are identified. Previous methods have approached the challenge of identifying individual organisms from microbial communities using SNPs (for example, oligotyping26 and minimum entropy decomposition27), but were limited to SNPs within the span of a sequence read length. ConStrains overcomes this read length limitation and represents each strain by a barcode-like string of concatenated SNPs spanning hundreds of genes, referred to as the “uniGcode.” To derive the strain’s uniGcodes within a data set, ConStrains constructs candidate models of strain combinations using a combination of SNP-flow and SNP-type clustering algorithms. Sequentially, the relative abundance of strains in each candidate model across the cohort is estimated using a Metropolis-Hastings Markov Chain Monte-Carlo approach (Fig. 1e–g and Online Methods). Finally, to choose the optimal model with the principle of balancing model fitness and complexity, corrected Akaike information criterion (AICc) is employed (Fig. 1h and Online Methods). ConStrains repeats these steps for each species with sufficient coverage, then outputs the number of strains and their respective uniGcodes and relative abundances (Fig. 1i). The uniGcode allows downstream analysis such as cross-sample comparisons and evolutionary studies.
ConStrains identifies strains in large data sets
To validate the performance of ConStrains for strain profiling, we used in silico and host-derived data sets. A total of 36 different sets of k-strain mixtures were generated using in silico genome-based Illumina paired-end read simulation based on ten different Escherichia coli strains whose complete genomes are publicly available, representing real-life scenarios of strain admixtures (k = 2–7; Fig. 2a–b and Supplementary Fig. 3a, Supplementary Table 1). These 36 sets of reads were profiled by ConStrains using default settings. Predicted results were compared with the ‘true’ strain compositions using Jensen-Shannon divergence (JSD; Fig. 2b and Supplementary Fig. 3b). ConStrains successfully predicted the underlying intra-species compositions in all 36 data sets (P < 1 × 10−5; two group t-test against random guesses; Fig. 2b), demonstrating a substantial advantage (Supplementary Fig. 4) over reference-base approaches (see Supplementary Note 1 and Supplementary Fig. -5 for details and comparisons). Furthermore, in 34 of the 36 sets of reads (94.44%), the numbers of strains inferred exactly matched the ground truth (Fig. 2a), with the remaining two sets of reads having an additional chimeric strain predicted at an extremely low level (<0.1%). We therefore set the recommended detection limit at 0.1% to reduce such errors computationally. Since this is a relative abundance threshold, one can still target low abundance organisms by increasing sequence depth. In similar simulations with up to 30 E. coli strains, ConStrains predicted the strain composition with high confidence when the strain number was less than ten (Fig. 2c), which represents the intra-specific upper bound for more than 95% of metagenomic species (Fig. 2d and Supplementary Note 1). To assess the impact of intra-species recombination on performance, both real sequencing reads from highly recombined Burkholderia pseudomallei strains28 and in silico-simulated recombinant strain-derived reads were generated, and no significant adverse impact was identified (Supplementary Note 1). We also further tested the performance in a more realistic metagenomic scenario by embedding E. coli strains within communities with various levels of complexity and found our approach remained robust (Online Methods, Supplementary Note 2, and Supplementary Table 2). We also found no significant correlation between admixture compositions’ alpha diversity and prediction accuracy. These results collectively suggested good algorithm performance (Supplementary Note 1).
We then tested ConStrains using a host-derived metagenomic data set that had previously been analyzed using a manually curated strain identification approach. Using manual strain curation the authors had for the first time described the changes in an infant gut microbiome during the first 24 days of life4. All three manually curated Staphylococcus epidermidis strains reported in this study were successfully predicted by ConStrains in a fully automated manner, with the predicted relative abundances of each strain over time having highly similar values to the original compositions quantified from the scaffold coverage (JSD avg. = 0.024, s.d. = 0.021; Supplementary Fig. 6). Because the performance of ConStrains’ fully automated approach matched well with the manually curated, semi-automated approach described previously4, but required far less machine and manual resources (ConStrains completed the infant gut data set in 8.5 CPU hours with RAM peak footprint of < 2GB on a Linux server with Xeon 2.6GHz processors, in contrast to days to weeks of manual curation following assembly), we next applied ConStrains to a very large data set for which a manual effort would not be feasible (for detailed resource usage, see Supplementary Note 5 and Supplementary Table 3).
In the absence of the existence of such a large data set (especially one where true results were known), we used a simulated shotgun data set with intra-specific structure mimicking the natural relative abundance of taxa informed by a recent gut microbiome collection effort for which samples were collected daily over the course of one year29 (Online Methods and Supplementary Note 3) (Fig. 3a). ConStrains analyzed 91 species with sufficient depth in the 322 in silico samples. In total, ConStrains analysed 3.2 terabases of paired-end reads contained 1,361 strains from 320 species, with minimal runtime and infrastructure requirements (Supplementary Note 3). ConStrains achieved high accuracy in individual samples, and also captured crucial information such as dominant strain type changes, for example in Bacteroides fragilis (Fig. 3a–c and inset windows 1–3; see Supplementary Table 4 and Supplementary Note 3 for details). This large cohort also enabled us to test factors that might affect the performance of ConStrains, including population complexity, coverage, and relatedness. We found that 10× coverage was necessary for accurate profiling, and that strain relatedness could also affect performance (Supplementary Fig. 7 and Supplementary Note 3). With this thorough benchmarking, we next applied ConStrains to two previously published clinical data sets to illustrate the biological insights strain level analyses can provide.
ConStrains reconstructs strain phylogeny
Lieberman and co-workers previously reported on the genetic variation of Burkholderia dolosa in cystic fibrosis patients by combining a selective culturing step with a deep population sequencing approach30. We re-analyzed their data set using our ConStrains algorithm and predicted a total of six B. dolosa strains in the population with an abundance well above 0.1% (pop-I to pop-VI; Fig. 4a). We compared the uniGcodes from the six strains inferred by ConStrains with the isolate genome sequence by building a phylogenetic tree, and found that all of the colony strains and two population strains (pop-I and pop-II) were closely related (Fig. 4a). Moreover, the combined relative abundance of pop-I and pop-II represented the majority of the population (51.3% and 27.9% for pop-I and pop-II, respectively). This finding corroborated observations based on the colony sequencing approach. In addition, the ConStrains algorithm identified four additional, less abundant strains (pop-III to pop-VI). None of these strains could be inferred by the colony sequencing approach alone, likely because of their low abundance (11.2%, 8.1%, 1.0%, and 0.5%, respectively). To validate these additional predictions, we further examined the polymorphism patterns in these four strains, and compared them against pop-I and pop-II. As shown in Fig. 4b, we found patterns that are unlikely to have resulted from chimeric mixtures of SNPs from pop-I and pop-II (P < 0.01, permutation test). This analysis demonstrated that application of ConStrains to cross-sectional datasets, used in addition to a culture-based approach, allows for a comprehensive and efficient discovery of rare strains.
Uncovering strain dynamics in infant gut development
We next analyzed an infant gut development dataset containing 54 samples from 9 subjects collected over the first three years of life (Online methods and Supplementary Fig. 8) to further explore the ability of ConStrains to reveal strain dynamics. ConStrains analysis was run on a total of 75 species that had sufficient sequencing depth for analysis (10×; Fig. 5). Because previously reported strain detection algorithms were limited to studying the population consensus sequences12, and ConStrains has the capability to untangle intra-species diversity, we first examined the number of strains observed within each species. Nearly all species (94.66%) had more than two strains, with an average of 4.88 strains per subject (±1.54 s.d.; Supplementary Fig. 9). By tracking the uniGcode of each strain in separate individuals, we identified several unique strain-level longitudinal patterns. For instance, the population of Fecalibacterium prausnitzii was usually comprised of several strains that maintained a co-dominant profile, in which the strains maintained the same order of abundance; Ruminococcus gnavus had highly variable behaviors over time, with different strains dominating the intra-species composition at different time points; in contrast, Bacteroides ovatus contained one dominant strain over time keeping other strains relatively rare. Bifidobacterium bifidum strains demonstrated comparable dynamic patterns similar to F. prausnitzii; moreover, the strains reestablished the same intra-specific diversity even after the abundance of the species dropped below the detection limit (Fig. 5, open boxes). We anticipate that the capability of generating better insights in intra-species dynamics of such health-related species31–33 will shed light on the role of these organisms in human physiology.
With this goal in mind, we pursued our findings in Bifidobacterium longum, an organism linked to human gut health and applied to prevention and treatment of several diseases33. We first observed that the phylogeny of B. longum strains strongly correlated with their host origins (Fig. 5, phylogenetic treem insert box), which indicated strong individuality of B. longum strains. Moreover, in two subjects (4 and 6, Fig. 6a), we observed switches in dominant strain types that were highly correlated with the overall relative abundance of the B. longum species. As previous work has shown that a single operon can affect the competitiveness of different Bacteroides fragilis strains34, we evaluated functional differences between different dominant strains. In both subjects, the different strains dominating during consecutive phases (period 2 in subject 4 and period 1 for subject 6; Fig. 6a) carried additional functions that might be crucial to B. longum’s successful colonization of the host gut. In particular, presence of the human milk oligosaccharide (HMO) utilization cluster has been shown to result from an adaptation to the human infant gut35 (Fig. 6b; highlight IV). Some additional functions might underlie formation of a B. longum bloom including the presence of the fructose and L-fucose utilization gene clusters (Fig. 6b; highlights I and III). Together, these findings might explain why strains with these functions were associated with higher relative abundance of B. longum in the infant gut microbiome. We also observed functions specific to strains that were dominant in periods when B. longum was less abundant (periods 1 and 3 in subject 4 and period 2 in subject 6; Fig. 6a), most notably that the capsular polysaccharide biosynthesis genes were absent from dominant strains when B. longum was more abundant (Fig. 6b; highlight II). Taken together, strain-level insights provided by ConStrains, combined with functional analyses, could offer candidate targets and hypotheses for future studies.
DISCUSSION
We have shown that the ConStrains algorithm accurately predicts strain-level profiles in large cohorts of metagenomic samples, and that the inferred uniGcodes reconstruct strain phylogeny, within or across cohorts, allowing combined cohort studies. ConStrains is scalable and has minimal resource requirements. In contrast, other approaches14,16,17 are largely dependent on prior knowledge of reference strain genomes, with sub-species resolution being directly dependent on the number of available reference strains per species. This greatly limits the application of such methods on real metagenomic data, as for most of the human microbiome species only one reference genome is available14. Current databases are quickly gaining in intra-species genome representation, but are still far from saturating natural diversity. With just one genome per species, ConStrains can resolve natural diversity occurring within that species, and is therefore agnostic to unknown strains. Future improvements for strain-level analysis include identification of strains in the absence of any reference genomes. It is conceivable that combinaing ConStrains with de novo genome assembly from metagenomic data could be an appropriate candidate to overcome this hurdle.
ConStrains is particularly effective for obtaining insights that were previously overlooked using species level findings (Supplementary Note 4 and Supplementary Figs. 10–12), and will thus enable new types of studies. As shown above with the B. longum example, combining strain-level profiles with reference genome-based gene coverage analysis can reveal candidate genes for understanding strain-specific beneficial effects and the functions that might contribute to successful colonization in the human gut. ConStrains could also identify strains or genes associated with disease and link variable genomic regions to individual strains, a major challenge in shotgun metagenomics. Strain-level profiles, together with appropriate metadata, could link reference-based or de novo assembled genes with individual strains and further interpret unknown strain-specific functions. Our study of the infant gut development cohort captured HMO utilization cluster enrichment shifts in different development periods, which is particularly important because products of the HMO utilization cluster are essential for B. longum to exert its probiotic effects36. Finally, strain phylogeny could be used across cohorts and add metagenomic means to test fundamental ecological hypotheses, including neutral theory and other adaptive and nonadaptive mechanisms for maintaining sympatric diversity among strains. Although we have applied ConStrains to human microbiome datasets, it can also be applied to environmental samples to test fundamental hypotheses about the role of microbes in the environment that can only be addressed at the strain level.
Online methods
ConStrains algorithm
Extracting target species and informative SNPs
With raw reads from samples S1, S2, …, Sn, ConStrains starts with profiling input metagenomes using MetaPhlAn1 (v1.7) with default settings, with the exception that alignment options are set to “very-sensitive”; species with average coverage higher than a coverage cutoff (default: 10×) in at least one sample are selected for further strain analysis. For each of the selected species, the corresponding set of the universally conserved genes reported by Segata et al.1 are used as a database, and Bowtie224 mapping with default setting is carried out to map reads against those reference genes. Only reads with proper pairing and orientation, no indels, >30 mapping quality, >90 length mapped (overhanging part at gene 5′ and 3′ ends is clipped off before calculation), and at least 95% nucleotide identity with the reference gene are further used. These reads are then piled up onto their respective reference sequences using SAMtools25, and the reference gene coverage is subsequently calculated on a per-base basis. To filter out genes with spurious mappings due to hypervariable regions or conserved universal motifs, sites with less than default minimum coverage, as well as those that fall outside of the 1.5 interquartile coverage range across the gene length, are masked. Any gene with more than 30% of its length masked is discarded from further analysis. Single nucleotide polymorphism sites (SNPs) are then counted across samples as those unmasked positions where the minor allele has at least two counts or more than 3% in relative abundance.
Strain typing by SNP-flow algorithm
With SNPs extracted, ConStrains first infers the strain composition and their SNP-types using the “SNP-flow” algorithm in per-species per-sample fashion. In this algorithm, all SNP sites are first hierarchically clustered by the Euclidean distance between the frequencies of different alleles defined as
where a and b are the frequency vector of the four bases sorted in descending order of the respective SNPs. Clusters that contain less than 5% of the overall SNPs or fewer than ten SNPs are discarded. The centroid of each cluster is selected as representative. These SNP cluster centroids (SCCs) are then ranked in descending order based on their weight quantified as the number of SNPs they represent. Finally, a directed graph is constructed using these SCCs, in which nodes are alleles in these SCCs and each is assigned a “capacity” defined by the allele frequency, and these alleles from neighboring SCCs are connected by edges (Fig. 1e).
In the directed graph constructed in the previous step, nodes are denoted from the same SCC as a layer. With m layers in the graph, SNP-flow will explore all possible combinations of paths from the first layer to the last. This way, every such path represents a strain genotype, and its relative abundance, c, is defined as the lowest node capacity among all nodes on the path. Once a path is visited, all nodes on this path would subtract their capacity by the path’s relative abundance c (Fig. 1e). Such a pathfinding and visiting step is repeated until all nodes” capacities are zero, and the visited paths constitute one combination. ConStrains exhausts all possible SNP-type (strains) combinations β = {β1, β2, …, βk} in each sample with the i-th sample’s SNP-type βi = bi1bi2…bih where bij is one of the four bases, A, C, G, and T, and the associated strain profile αi = (αi1,αi2,…αih) with
For each sample, ConStrains picks the optimal combination that minimizes the fitting error, defined as the discrepancy between expected SNP frequencies and observed frequencies, ε, defined as:
where Eij is expected frequency of the i-th base at the j-th SNP locale; and similarly, Oij is the observed frequency of the i-th base at the j-th SNP locale in the pileup of aligned reads from the corresponding sample. For instance, C is the second base (i = 2), and if we observed two C’s and eight A’s at the fifth SNP locale (j = 5) in the pileup of aligned reads against reference, the frequency of C is 0.2 at that position and thus is referred to as O25 = 0.2. Eij is inferred using αi and βi such that
Inferring strain compositions
To unify these optimal SNP-types into cohort-wide strains, ConStrains next constructs a neighbor-joining tree of the SNP-types from different samples based on sequence percentage identity, and utilizes an internal parameter, Δd, defined as the distance between the tree-cutting point and the leaves, to cut the tree. Rather than using a preset value, the algorithm cuts this tree using all possible Δd. Each internal node created by such a cut could be viewed as the representative of all the children nodes (SNP-types) on the tree. In doing so, it identifies all possible k clusters defined by the structure of the tree of SNP-types (Fig. 1f), which we refer to as candidate strains.
With the proposed k strains from the previous step, in each sample, we need to find a composition, α*= (α*1, α*2, …, α*k) with
to minimize the discrepancy between expected SNP frequencies and observed frequencies, ε, as defined previously. This is carried out by a Metropolis-Hasting Monte-Carlo method. ConStrains first generates a number of seeds (default: 1,000) at uniform random on k−1 simplex. The top 50 seeds are then selected and each such seed’s vicinity on the k−1 simplex is iteratively searched. In iteration t, a new point, αtik, is picked within the 0.01 radius of the previous point, αt−1ik; and it is accepted as the new point with probability min(1, q(αtik, αt−1ik)), where q(αtik, αt−1ik) = ε(αtik)/ε(αt−1ik). It repeats the iteration until |1−q(αtik, αt−1ik)|is smaller than 0.001 or the maximum number of iterations (10,000) is reached. The composition yielding the lowest ε is selected as optimal α*ik. ConStrains repeats this step for all samples and all k, yielding solutions for each k, α*k = (α*1, α*2, …, α*n), with corresponding error (Fig. 1g):
Selecting the optimal strain model
Corrected Akaike information criterion (AICc) is employed to select optimal k. The AICc of each k is calculated as:
where L = 1−εk denotes the model likelihood. ConStrains selects the k with the lowest AICc and outputs the associated SNP-types and compositions as final results (Fig. 1h). As noted previously, we suggest filtering strains with less than 0.1% in relative abundance as they present a high probability of being chimeric.
In silico data sets
To simulate in silico single species data sets, 62 complete E. coli genome sequences were downloaded from NCBI. Ten genomes were selected and their relatedness was shown by a maximum likelihood tree (Supplementary Fig. 3a) constructed from concatenated nucleotide sequences of core genes among the 10 strains using a method similar to Luo et al.19. 1,000 random compositions were sampled on a Gamma distribution with k = 1 and θ = 0.5 for each number of strains (N = 2–7). In each set of these 1,000 compositions, Shannon entropy was calculated and based on which these compositions were ranked. The compositions on the 15th, 30th,…, 90th percentiles were picked to form a gradient of intra-specific diversity for each N. ART simulator37 was employed to simulate 100× coverage of 100 bp paired-end Illumina reads using these compositions with default settings for Illumina and library settings as “-m 350 -s 50” (Supplementary Fig. 3a). These samples were further grouped together to simulate single strain series samples (Supplementary Table 1).
These simulated E. coli reads were then spiked into in silico-constructed metagenomes to measure the impact from other species. Three human microbiome-like metagenomes with low, medium, and high complexity level (referred as LC, MC, and HC, respectively) were simulated based on an aggregated MetaPhlAn1 profile over all 690 Human Microbiome Project (HMP) samples38. E. coli and Shigella were excluded from the profile, and the rest of the species were ranked based on their average abundance in the HMP cohort. The top 20, 50, and 100 most abundant species were selected for LC, MC, and HC, respectively. The species composition in each in silico metagenome was calculated as their relative abundance in the HMP cohort, normalized by their total sum. Genomes of these species were downloaded from NCBI, and a representative strain was selected at random if multiple strains of the same species were present. A total of 100 million 100 bp paired-end Illumina reads were simulated for each set by ART simulator37 with the same settings as mentioned previously. Additional data sets for testing the sensitivity and the performance on different numbers of strains and recombined strains were generated in a similar fashion using ART (Supplementary Note 1 for details).
The year-long shotgun metagenome cohort with 322 samples was simulated based on donor A’s 16S rRNA amplicon profiles reported in David et al.29. The operational taxonomy unit (OTU) table was used as a guide for community composition in human microbiomes. To allow simulation at the strain level, however, taxonomy in the OTU table was shifted down by one level. For instance, species composition in the original OTU table was shifted to be the strain composition. NCBI draft and complete genomes were used to match as closely as possible the phylogeny of the original OTUs. Reads were then simulated by ART simulator as previously described. The coverage was set to be 1× per 25 read counts in the 16S OTU table.
Biological data sets
The two infant gut development longitudinal metagenomic data sets used in this study were from a previous study4 and from our recent effort in tracking nine subjects in a three-year period since birth. For the former set, all metagenomic samples were downloaded from NCBI SRA under accession number SRA052203, and the corresponding assembled Staphylococcus epidermidis strains and phage genomes were downloaded from ggKBase as described by Sharon et al.4. For the latter set, 54 stool samples were collected from nine infant subjects between September 2008 and August 2010 in Finland. Samples were first collected by the subjects’ parents and stored in the household freezer before being transferred on dry ice to a laboratory −80 °C freezer. Samples were then shipped to the Broad Institute for DNA extraction, in which QIAamp DNA Stool Mini Kit (Qiagen, Inc., Velencia, CA, USA) was used as described previously39. Library construction was carried out following Human Microbiome Project’s standard protocol (http://hmpdacc.org/tools_protocols/tools_protocols.php), and 101bp paired-end reads were produced on an Illumina HiSeq 2000 platform. The raw sequences of these samples are available at SRA under BioProject accession number PRJNA269305, and the corresponding sample information is available in Supplementary Table 5.
Prediction accuracy measurement
To measure how close the predicted composition, P, is from the true composition, Q, we applied Jenson-Shannon divergence with minor modifications. Since it is possible that P and Q are of different dimensions, we first padded the one with lower dimension with zeros to match the one with the higher dimension, and then defined a composition M based on sorted P and Q, P′ and Q′, as:
Therefore the Jenson-Shannon divergence is:
where D(X||Y) is the Kullback-Leibler divergence defined as:
We calculate the SNP typing accuracy as the distance between the inferred SNP tree of strains, Tp, and the true strain tree constructed from concatenated core genes, Tq. First, a distance similar to the symmetric difference introduced by Robinson and Foulds is applied to calculate the distance, d, between these two trees. We then normalize d to the expected basal distance from a random tree with the same leaves. The expected basal distance, d, is the mean distance between Tq and 1,000 randomly generated trees with the same leaves.
Supplementary Material
Acknowledgments
We thank Natalia Nedelsky for editorial support. This work was supported in part by the Crohn’s and Colitis Foundation of America, the Leona M. and Harry B. Helmsley Charitable Trust, National Institutes of Health (NIH) grants U54 DK102557 (R.J.X.) and R01 DK092405 (R.J.X.)., and the Howard Hughes Medical Institute (R.K.).
Footnotes
Author Contributions
C.L. and D.G. conceived the project, C.L. designed and implemented the algorithm, C.L., D.G., and R.J.X. designed the experiments and C.L. performed the analysis. M.K., H.S., D.G., and R.J.X. collected and sequenced the samples. C.L., R.K., R.J.X., and D.G. wrote the paper.
Competing Financial Interests
The authors declare no competing financial interests.
References
- 1.Segata N, et al. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat Methods. 2012;9:811–814. doi: 10.1038/nmeth.2066. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Sunagawa S, et al. Metagenomic species profiling using universal phylogenetic marker genes. Nat Methods. 2013;10:1196–1199. doi: 10.1038/nmeth.2693. [DOI] [PubMed] [Google Scholar]
- 3.Darling AE, et al. PhyloSift: phylogenetic analysis of genomes and metagenomes. PeerJ. 2014;2:e243. doi: 10.7717/peerj.243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Sharon I, et al. Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization. Genome Res. 2013;23:111–120. doi: 10.1101/gr.142315.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Nielsen HB, et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat Biotechnol. 2014;32:822–828. doi: 10.1038/nbt.2939. [DOI] [PubMed] [Google Scholar]
- 6.Imelfort M, et al. GroopM: an automated tool for the recovery of population genomes from related metagenomes. PeerJ. 2014;2:e603. doi: 10.7717/peerj.603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Luo C, et al. Genome sequencing of environmental Escherichia coli expands understanding of the ecology and speciation of the model bacterial species. Proc Natl Acad Sci USA. 2011;108:7200–7205. doi: 10.1073/pnas.1015622108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Kashtan N, et al. Single-cell genomics reveals hundreds of coexisting subpopulations in wild Prochlorococcus. Science. 2014;344:416–420. doi: 10.1126/science.1248575. [DOI] [PubMed] [Google Scholar]
- 9.Faith JJ, et al. The long-term stability of the human gut microbiota. Science. 2013;341:1237439. doi: 10.1126/science.1237439. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Maslunka C, Gifford B, Tucci J, Gurtler V, Seviour RJ. Insertions or deletions (Indels) in the rrn 16S–23S rRNA gene internal transcribed spacer region (ITS) compromise the typing and identification of strains within the Acinetobacter calcoaceticus-baumannii (Acb) complex and closely related members. PLoS ONE. 2014;9:e105390. doi: 10.1371/journal.pone.0105390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Han D, et al. Population structure of clinical Vibrio parahaemolyticus from 17 coastal countries, determined through multilocus sequence analysis. PLoS ONE. 2014;9:e107371. doi: 10.1371/journal.pone.0107371. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Schloissnig S, et al. Genomic variation landscape of the human gut microbiome. Nature. 2013;493:45–50. doi: 10.1038/nature11711. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Beitel CW, et al. Strain- and plasmid-level deconvolution of a synthetic metagenome by sequencing proximity ligation products. PeerJ. 2014;2:e415. doi: 10.7717/peerj.415. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Greenblum S, Carr R, Borenstein E. Extensive Strain-Level Copy-Number Variation across Human Gut Microbiome Species. Cell. 2015;160:583–594. doi: 10.1016/j.cell.2014.12.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Karlsson E, et al. Eight new genomes and synthetic controls increase the accessibility of rapid melt-MAMA SNP typing of Coxiella burnetii. PLoS ONE. 2014;9:e85417. doi: 10.1371/journal.pone.0085417. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Hong C, et al. PathoScope 2.0: a complete computational framework for strain identification in environmental or clinical sequencing samples. Microbiome. 2014;2:33. doi: 10.1186/2049-2618-2-33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Ahn TH, Chai J, Pan C. Sigma: strain-level inference of genomes from metagenomic analysis for biosurveillance. Bioinformatics. 2015;31:170–177. doi: 10.1093/bioinformatics/btu641. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Miller JR, Koren S, Sutton G. Assembly algorithms for next-generation sequencing data. Genomics. 2010;95:315–327. doi: 10.1016/j.ygeno.2010.03.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Luo C, Tsementzi D, Kyrpides NC, Konstantinidis KT. Individual genome assembly from complex community short-read metagenomic datasets. ISME J. 2012;6:898–901. doi: 10.1038/ismej.2011.147. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Nijkamp JF, Pop M, Reinders MJ, de Ridder D. Exploring variation-aware contig graphs for (comparative) metagenomics using MaryGold. Bioinformatics. 2013;29:2826–2834. doi: 10.1093/bioinformatics/btt502. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Lasken RS, McLean JS. Recent advances in genomic DNA sequencing of microbial species from single cells. Nat Rev Genet. 2014;15:577–584. doi: 10.1038/nrg3785. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Ivanova N, et al. Genome sequence of Bacillus cereus and comparative analysis with Bacillus anthracis. Nature. 2003;423:87–91. doi: 10.1038/nature01582. [DOI] [PubMed] [Google Scholar]
- 23.Segata N, Bornigen D, Morgan XC, Huttenhower C. PhyloPhlAn is a new method for improved phylogenetic and taxonomic placement of microbes. Nat Commun. 2013;4:2304. doi: 10.1038/ncomms3304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Li H, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Eren AM, et al. Oligotyping: Differentiating between closely related microbial taxa using 16S rRNA gene data. Methods Ecol Evol. 2013;4:1111–1119. doi: 10.1111/2041-210X.12114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Eren AM, et al. Minimum entropy decomposition: Unsupervised oligotyping for sensitive partitioning of high-throughput marker gene sequences. ISME J. 2014;9:968–979. doi: 10.1038/ismej.2014.195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Nandi T, et al. Burkholderia pseudomallei sequencing identifies genomic clades with distinct recombination, accessory, and epigenetic profiles. Genome Res. 2015;25:129–141. doi: 10.1101/gr.177543.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.David LA, et al. Host lifestyle affects human microbiota on daily timescales. Genome Biol. 2014;15:R89. doi: 10.1186/gb-2014-15-7-r89. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Lieberman TD, et al. Genetic variation of a bacterial pathogen within individuals with cystic fibrosis provides a record of selective pressures. Nat Genet. 2014;46:82–87. doi: 10.1038/ng.2848. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Sokol H, et al. Faecalibacterium prausnitzii is an anti-inflammatory commensal bacterium identified by gut microbiota analysis of Crohn disease patients. Proc Natl Acad Sci USA. 2008;105:16731–16736. doi: 10.1073/pnas.0804812105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Crost EH, et al. Utilisation of mucin glycans by the human gut symbiont Ruminococcus gnavus is strain-dependent. PLoS ONE. 2013;8:e76341. doi: 10.1371/journal.pone.0076341. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Di Gioia D, Aloisio I, Mazzola G, Biavati B. Bifidobacteria: their impact on gut microbiota composition and their applications as probiotics in infants. Appl Microbiol Biotechnol. 2014;98:563–577. doi: 10.1007/s00253-013-5405-9. [DOI] [PubMed] [Google Scholar]
- 34.Lee SM, et al. Bacterial colonization factors control specificity and stability of the gut microbiota. Nature. 2013;501:426–429. doi: 10.1038/nature12447. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Schell MA, et al. The genome sequence of Bifidobacterium longum reflects its adaptation to the human gastrointestinal tract. Proc Natl Acad Sci USA. 2002;99:14422–14427. doi: 10.1073/pnas.212527599. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Sela DA, et al. The genome sequence of Bifidobacterium longum subsp infantis reveals adaptations for milk utilization within the infant microbiome. Proc Natl Acad Sci USA. 2008;105:18964–18969. doi: 10.1073/pnas.0809584105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinformatics. 2012;28:593–594. doi: 10.1093/bioinformatics/btr708. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Human Microbiome Project. C. Structure, function and diversity of the healthy human microbiome. Nature. 2012;486:207–214. doi: 10.1038/nature11234. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Morgan XC, et al. Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 2012;13:R79. doi: 10.1186/gb-2012-13-9-r79. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.