metaVaR: Introducing metavariant species models for reference-free metagenomic-based population genomics

Romuald Laso-Jadart; Christophe Ambroise; Pierre Peterlongo; Mohammed-Amin Madoui

doi:10.1371/journal.pone.0244637

. 2020 Dec 30;15(12):e0244637. doi: 10.1371/journal.pone.0244637

metaVaR: Introducing metavariant species models for reference-free metagenomic-based population genomics

Romuald Laso-Jadart ¹, Christophe Ambroise ², Pierre Peterlongo ³, Mohammed-Amin Madoui ^1,^*

Editor: Francisco Rodriguez-Valera⁴

PMCID: PMC7773188 PMID: 33378381

Abstract

The availability of large metagenomic data offers great opportunities for the population genomic analysis of uncultured organisms, which represent a large part of the unexplored biosphere and play a key ecological role. However, the majority of these organisms lack a reference genome or transcriptome, which constitutes a technical obstacle for classical population genomic analyses. We introduce the metavariant species (MVS) model, in which a species is represented only by intra-species nucleotide polymorphism. We designed a method combining reference-free variant calling, multiple density-based clustering and maximum-weighted independent set algorithms to cluster intra-species variants into MVSs directly from multisample metagenomic raw reads without a reference genome or read assembly. The frequencies of the MVS variants are then used to compute population genomic statistics such as F_ST, in order to estimate genomic differentiation between populations and to identify loci under natural selection. The MVS construction was tested on simulated and real metagenomic data. MVSs showed the required quality for robust population genomics and allowed an accurate estimation of genomic differentiation (ΔF_ST < 0.0001 and <0.03 on simulated and real data respectively). Loci predicted under natural selection on real data were all detected by MVSs. MVSs represent a new paradigm that may simplify and enhance holistic approaches for population genomics and the evolution of microorganisms.

Introduction

Thanks to advances in deep sequencing and metagenomics, microorganism genomic resources have become more widely available over the last two decades. By analyzing community assemblies containing a large number of uncultured species [1] we are gaining a better understanding of microbial ecology. This is especially the case for marine, soil and gut microbiomes that have been intensively investigated thanks to large sequencing consortia like Tara Oceans [2, 3], TerraGenome [4] or MetaHit [5].

Currently, in order to make effective use of whole-genome metagenomic data when addressing questions of molecular evolution and population genomics in uncultured species, reference genome or transcriptome sequences are required and a priori selected for the target species. Typically, metagenomic reads are first aligned on reference sequences for variant calling, then population alleles or amino acid frequencies [6] are obtained. Derived population genomic metrics are then computed and used to identify genomic differentiation between populations or natural selection of variants that drive genomes evolution. This approach has in particular been applied to gut microbiomes of vertebrates [7, 8] and invertebrates [9], marine bacteria [6, 10] and crustaceans [11].

In metagenomics the filtering step in which reads are aligned is critical in order to avoid the cross-mapping of reads from a given species to the reference genome of another species. The main filters are, first, the selection of genomic regions with a depth of coverage within an expected range specific to the species abundance, and, second, the minimum identity percentage of a read aligned to the reference for membership of the targeted species [11, 12].

This alignment-based approach is currently limited by the number of available reference sequences. To increase the number of references for organisms found in environmental samples, a number of methods have been developed to produce metagenome assembled species (MAGs) from whole-genome metagenomic sequencing. These approaches have successfully been applied to prokaryotes [13–15], but the main limitation of the alignment-based approach remains, namely that it is dependent on the availability, completeness and quality of reference genomes that can be constructed from metagenomic data [14]. Recently the approach has been able to make use of long-read sequencing [16, 17]. However, in the case of eukaryotes, because of the large genome size and the difficulties in obtaining high molecular weight DNA, results for eukaryotes are still lacking.

To bypass the use of references in the variant calling process, several reference-free variant calling methods have been developed. Among them, we can distinguish complete reference-free approaches as implemented in softwares like ebwt2SNP [18], kmer2SNP [19] and DiscoSNP++ [20] to partial reference-free approaches like Kevlar [21] or scalpel [22]. These latter approaches are based on micro-assembly that generates contigs but still need a reference. None of these methods were specifically designed to deal with metagenomic data.

To identify nucleotide variants from metagenomic data, the use of DiscoSNP++ has recently been proposed. DiscoSNP++ detects variants by identifying bubbles in a de Bruijn graph built directly from the raw metagenomic reads. The variants can then be relocated on genomes of interest, if available. In comparison to alignment-based variant calling applied on metagenomic data, DiscoSNP++ has been shown to be less sensitive but more specific in term of recall, and more accurate in term of allele frequency, especially in non-coding regions [23]. Given the sensitivity of population genomic analyses to the accuracy of the allele frequencies, we might consider the use of DiscoSNP++ preferable to the alignment-based approach for population genomics based on metagenomic data. Moreover, the indexing method implemented in DiscoSNP++ uses bloom filters, a space-efficient probabilistic data structure. This enables the indexing of very large and complex dataset and makes DiscoSNP++ a very suitable tool to handle metagenomic data.

To bypass the lack of reference for micro-organisms in order to perform population genomics using metagenomic data, we first introduce the notion of metavariants, i.e variants detected directly from raw metagenomic reads without a reference genome. Then, we present an approach for clustering metavariants by species when reference genomes are not available. We propose the resulting clusters as a new form of species representation that we call metavariant species or MVS. We establish a formal definition of the metavariant and MVS. We implemented the clustering method in an R package called metaVaR. metaVaR allows the construction of the MVSs and also their manipulation to perform population genomic analyses. The clustering of metavariants implemented in metaVaR was benchmark with state-of-the-art clustering algorithms. We also tested the relevance of MVSs using simulated and real metagenomic data to perform accurate population genomic analyses.

Methods

From metavariants to metavariant species

Variable loci and metavariants

We define a metavariant as a single nucleotide variant detected directly from metagenomic data without a reference genome (Fig 1). We use metavariants produced by DiscoSNP++ and consider only metavariants located on loci producing one metavariant. Due to the absence of a reference genome, the reference (a) and alternative (b) nucleotides are chosen by DiscoSNP++ based on alphabetic order. In a single sample, a and b, are characterized by the count of reads supporting them, and a locus l that harbors a metavariant can be represented by its depth of coverage c as the sum of reads supporting a and b. Each locus l is described by the m sample supporting counts l = {c₁, …, c_m}. The n metavariant loci row vectors l_i generated from m samples metagenomic data are placed in the n * m depth of coverage matrix, $L = {l_{i}} = (c_{i j}) \in N^{n * m}, i \in {1, \dots, n}, j \in {1, \dots, m}$ . At this step, it is important to note that DiscoSNP++ generates a fasta file of the metavariants that can be possibly mapped on a given reference.

Clustering of metavariants into metavariant species

A metavariant species or MVS corresponds to a set of intra-species metavariants of the same species. If it is assumed that L contains both inter and intra-species metavariants, MVSs can be represented by pairwise disjoint subsets of L not covering L.

As for the binning of metagenomic contigs [24], we consider that the depth of coverage of the variable loci of the same species covariates across samples and that this constitutes a species signature. MVSs can thus be identified by clustering L based on its values. However, the complexity of metagenomic data raises several issues. First, the number of species and corresponding MVSs is unknown. Second, the initial set of metavariants contains an admixture of inter and intra-species metavariants deriving from entire genomes including repeated regions. Only intra-species metavariants from single copy loci are informative for population genomics. Third, the genome size and the polymorphism rate vary considerably between species. This impacts the depth of coverage of the loci and the number of variants by species.

Metavariant species construction steps and algorithms

To create MVSs, we propose the following approach, described in detail in the following sections:

Reference-free metavariant calling with DiscoSNP++ from raw metagenomic data (Fig 2A).
Metavariant filtering and construction of matrices for the depths of coverage of the different loci and for metavariant frequencies (Fig 2A).
Multiple density-based clustering (mDBSCAN) of the metavariants. Each clustering generates a set of disjoint metavariant clusters (mvc) (Fig 2B).
Each mvc is scored according to its size and the expected depth of coverage distribution of its loci (Fig 2C).
A maximum-weighted independent set (WMIN) algorithm is applied on all mvc to select a subset of mvc as potential MVSs (Fig 2D).
Selection of the metavariant loci based on coverage expectation for robust population genomic analysis (Fig 2E).
Computation of population genomic metrics for each MVS (Fig 2F).

Fig 2 — A. Variant calling from raw metagenomic data. B. Multiple density-based clustering of metavariants *mDBSCAN-WMIN*. Black points represent inter-species variants, while other colored points refer to unclustered variants. Circle colors represent the dbscan parameters. C. Metavariant cluster scoring. D. Maximum-Weighted Independent Set algorithm. Each node is a cluster of metavariants, their circle color representing the dbscan parameters used to build the cluster. Grey zones represent the connected components. Colored nodes are MWIS and black nodes are MWIS neighbors. E. Metavariant filtering for MVS construction. Grey zones correspond to the single-copy loci. F. Population genomics of MVS.

Multiple density-based clustering of metavariants

To cluster L, we used density-based clustering (dbscan) [25]. This clustering algorithm requires two input parameters: epsilon (ϵ) and minimum points (p), corresponding respectively to the minimum euclidean distance between two points to be considered as members of the same cluster and the minimum number of points to extend a cluster. Given ϵ and p, dbscan generates a disjoint set of metavariant clusters {mvc_ϵ,p ⊂ L}. Intuitively, for L generated from real data, there might be no optimal (ϵ, p) enabling the best reconstruction of the clusters according to some criterion. Instead of choosing an arbitrary couple (ϵ, p), we run dbscan using a grid of (ϵ, p) values. This multiple density-based clustering (mDBSCAN) produces a set of possibly overlapping mvc. We call this set MVC, which by default is restricted to mvc containing more than 1,000 metavariants.

Scoring of metavariant clusters

Several metagenomic binning approaches use Gaussian or Poisson distributions to model genome sequencing coverage [13]. However, due to its overdispersion, the sequencing depth of coverage can be better approximated using a negative binomial (NB) distribution [26]. Let mvc_ϵ,p,k ∈ MVC denote the k^th mvc computed with parameters (ϵ, p). For each mvc_ϵ,p,k, in each sample, we compare the observed and expected NB coverage distribution of the loci using fitdistrplus [27]. For all possible mvc_ϵ,p,k we compute d_ϵ,p,k, the mean across m samples of the log-likelihood of the fitting with θ_j, the negative binomial distribution parameters in sample j.

\begin{matrix} d_{ϵ, p, k} = \frac{1}{m} \sum_{j = 1}^{m} ln L (θ_{j}, m v c_{ϵ, p, k} [j]) \end{matrix}

(1)

with mvc_ϵ,p,k[j] being the depth of coverage of the metavariant in the j^th sample in mvc_ϵ,p,k. We also compute ${\bar{d}}_{ϵ, p, k} \in [0, 1]$ , the corrected mean log-likelihood of each cluster,

\begin{matrix} {\bar{d}}_{ϵ, p, k} = \frac{d_{ϵ, p, k} - d_{m i n}}{d_{m a x} - d_{m i n}}, \end{matrix}

(2)

with d_min and d_max being respectively the smallest and the highest mean log-likelihood observed over all mvc ∈ MVC. We also normalised the size s_ϵ,p,k of each cluster such as ${\bar{s}}_{ϵ, p, k} \in [0, 1]$ ,

\begin{matrix} {\bar{s}}_{ϵ, p, k} = \frac{s_{ϵ, p, k} - s_{m i n}}{s_{m a x} - s_{m i n}}, \end{matrix}

(3)

with s_min and s_max being respectively the smallest and the highest sizes of all computed mvc. Finally we compute w_ϵ,p,k, the mvc_ϵ,p,k score as the geometric mean between (2) and (3),

\begin{matrix} w_{ϵ, p, k} = \sqrt{{\bar{d}}_{ϵ, p, k} . {\bar{s}}_{ϵ, p, k}} \end{matrix}

(4)

Metavariant species as a maximum-weighted independent set

Identifying the MVS is equivalent to simultaneously maximizing the number of non-overlapping mvc and their corresponding weights. This corresponds to a Maximum-Weighted Independent Set (MWIS) problem. Algorithms for this have been proposed by Sakai and colleagues [28], (Supplementary Methods 1 in S1 File). Here, we use the WMIN algorithm to find MWIS, the set of all MWISs.

In this context, MVC can be represented by a weighted undirected graph G(V, E, W), where ∀i, j ∈ {1, …, |V|}, v_i ∈ V represents mvc_i of weight w_i ∈ W and e_ij ∈ E ⇔ mvc_i ∩ mvc_j ≠ ∅ and mvc_i ≠ mvc_j. We recall here the outline of the Sakai WMIN algorithm, which takes G as input and iterates until G = ∅. At each iteration the following steps are performed: (i) detection of the connected components (cc); (ii) in each cc, finding the node that is the maximum-weighted independent set, v_i = mwis ∈ MWIS, if f(v_i) = argmax(f), with $f (v_{i}) = \frac{w_{i}}{d e g (v_{i}) + 1}$ and with deg(v_i) the v_i degree; (iii) in each cc, deletion the neighbors of the mwis from G, and storage of mwis in MWIS and deletion mwis. The fact that this algorithm needs $w_{i} \in R^{*}$ justifies (2) and (3).

Selection of metavariant clusters as metavariant species

A metavariant cluster is a potential MVS if it satisfies four criteria applied in order as follows. (i) The metavariant cluster is a maximum-weighted independent set. (ii) The metavariant cluster occurs in more than k_min populations (set to 4 by default) and corresponding loci have a median depth of coverage higher than c_m (set to 8 by default). (iii) The metavariant cluster’s filtered variable loci have a depth of coverage within c_min = c_m − 2 * sd, c_min ≥ 8 by default and c_max ≤ c_m + 2 * sd in all samples. (iv) The metavariant cluster contains more than m_min2 metavariants (set to 100 by default). More formally,

\begin{matrix} \begin{matrix} m v c = m v s \in M V S \equiv m v c \in M W I S \\ \land ∣ m v c ∣ \geq m_{m i n 1} \\ \land \forall j \in {1, . . ., k'}, k' \geq k_{m i n}, m e d i a n_{i \in [1, k]} (c_{i, j}) \geq c_{m} \\ \land \forall m_{i} \in m v c \equiv c_{j} \in [c_{m i n}, c_{m a x}], ∣ {m_{j}} ∣ \geq m_{m i n 2} \end{matrix} \end{matrix}

(5)

MVS-based population genomic analysis

The allele frequency of metavariant species was defined as $p = \frac{c_{a}}{c_{a} + c_{b}}$ , with c_a and c_b the number of read supporting the alleles a and b. The allele frequencies are used to compute classical population genomic metrics. This includes the global F_ST [29] such as $F_{S T} = \frac{\bar{p}}{\bar{p} (1 - \bar{p})}$ , with $\bar{p}$ being the mean allele frequency across all samples. We also computed the LK, a normalized F_ST, such as $L K = \frac{n - 1}{{\bar{F}}_{S T}} . F_{S T}$ , with ${\bar{F}}_{S T}$ being the mean F_ST across all loci. LK is expected to follow a χ² distribution when a large majority of the polymorphic loci are under neutral evolution [29]. To estimate the genomic differentiation between MVS populations, we compute the pairwise-F_ST between the different populations.

Implementation of the mDBSCAN-WMIN algorithm in metaVaR

The metavariants preprocessing step is performed by running metaVarFilter.pl, which produces the depth of coverage and frequency matrices from a reference-free vcf file. The MetaVaR package was written in R and provides three main functions for constructing MVSs:

tryParam creates metavariant clusters using several e, p values. We used the R package fitdistrplus to obtain the log-likelihood of the coverage distribution.
getMWIS identify the maximum-weighted independent sets.
mvc2mvs applies filters described in (5) to select the MVSs and performs population genomic analysis.

The metaVaR source code and manual are available at https://github.com/madoui/metaVaR

Metavariants from simulated metagenomic data

We downloaded six bacterial genomes from NCBI (Escherichia coli NC_000913.3, Pseudomonas aeruginosa NC_002516.2, Yersinia pestis NC_003143.1, Rhizobium tropici NC_020059.1, Rhizobium phaseoli NZ_CP013532.1, Rhodobacter capsulatus NC_014034.1). For each genome we created a derived genome where 1% of the genomic sites corresponds to randomly distributed SNPs for E. coli and 2% for the other bacteria. We used metaSim [30] to simulate Illumina paired-end 100 bp reads from 300bp genomic fragments on seven communities that contained different abundances (Fig 3A) and different proportions of original and derived genomes (Fig 3B).

Fig 3 — A. Variable loci genomic coverage distribution of the species. B. Within-species variants frequencies distribution.

To generate the metavariants, DiscoSNP++ was run on the total read set with parameter -b 1. As a control, the metavariants were relocated on the six original genomes using the -G option. From the VCF file produced by DiscoSNP++, the depth of coverage of the biallelic loci and allele frequencies were calculated using metaVarFilter.pl with parameters -a 10 -b 500 -c 7. Here, the first two parameters are the minimum and maximum cumulative depths of coverage of a locus, and the third parameter specifies that a locus is kept only if it occurs in at least seven samples.

Metavariants from real metagenomic data

To test the performance of the mDBSCAN-WMIN algorithm on real metagenomic data, we used metagenomic reads from five marine samples collected in the Mediterranean (Table 1). In a previous study, the reads were processed by DiscoSNP++ and the metavariants were aligned on the Oithona nana genome. Then, the genomic differentiation between samples was estimated by pairwise-F_ST and loci under selection were identified [23].

Table 1. Metavariant clustering performances on simulated metagenomic data.

Algorithm	Software	Recall	Precision	Signal-To-Noise	Purity	Entropy	Time CPU (min)^*
mDBSCAN-WMIN	metaVaR	0.5523	0.9996	1691.18	0.9999	0.008	3.96
CLR-kmeans	coseq	0.5945	0.9941	99.79	0.993	0.22	1.66
arcsin-GMM	coseq	0.3007	0.998	158.24	0.9915	1.77	4.41
logit-GMM	coseq	0.2685	0.9993	408.6	0.9892	1.76	5.94

Open in a new tab

*The computation was performed on Intel(R) Core(TM) i5-7200U CPU.

In the present study, DiscoSNP++ was run on the five read sets and the vcf output was filtered using metaVarFilter.pl with parameters -c 20 -b 250 -c 3. This produced two files containing the depth of coverage of metavariant loci and the metavariant frequency matrices. These two files were then used as input for metaVaR and other algorithms used for benchmarking (see next section for details).

As a control, the metavariants belonging to O. nana were identified by mapping the metavariants back onto the O. nana genome. The MWIS corresponding to O. nana was used to compare the genomic differentiation (pairwise-F_ST) estimated by metaVaR to the expected pairwise-F_ST values computed by the reference-based approach.

In O. nana, loci with LK values that were higher than expected (based on the χ² distribution) were considered under selection for p-value≤ 0.05.

Comparison of mDBSCAN-WMIN to other sequence abundance-based clustering algorithms

The depth of coverage matrix was used for metavariant clustering by mDBSCAN-WMIN using parameters e = (3, 4, 5, 6, 7) and p = (5, 8, 10, 12, 15, 20). There are no clustering algorithms explicitly developed for metavariant clustering. It is though possible to solve this clustering problem using other sequence abundance-based clustering algorithms, such as those used in RNA-seq data analysis to identify co-expressed genes. We tested state-of-the-art clustering algorithms as follows using coseq [31]: (i) centered log-ratio transform and k-means clustering (with k values from 2 to 12); (ii) arcsin transform and Gaussian Mixture Model with the same k values as in (i); (iii) logit transform and GMM with the same k values as in (i).

To evaluate the performances of each clustering algorithm on simulated data, the clusters were first assigned to one single original genome. It was based on the highest proportion of metavariants originating from the same genome, using the results of the alignment of the metavariants performed with bwa mem with default parameters.

For a cluster, we defined the true positives TP as the number of metavariants of the cluster deriving from the original genome, the false positives FP as the number of metavariants of the cluster not deriving from the original genome, the false negatives FN as the number of metavariants of the original genome not present in the cluster. We computed the recall, precision and signal to noise (STN) of each cluster, as follows: $r e c a l l = \frac{T P}{T P + F N}$ , $p r e c i s i o n = \frac{T P}{T P + F P}$ , $S T N = \frac{r e c a l l}{1 - p r e c i s i o n}$ , with TP, the number true positives, FN the number of false negatives and FP the number of false positives.

The global purity and entropy of the clustering was computed as follows: $P u r i t y = \frac{1}{n} \sum_{q = 1}^{k} {max}_{1 \leq j \leq l} n_{q}^{j}$ with n the number of metavariants, $n_{q}^{j}$ the number of metavariants in cluster q belonging to original species j, and $E n t r o p y = - \sum_{q = 1}^{k} \sum_{j = 1}^{l} \frac{n_{q}^{j}}{n_{q}} {log}_{2} \frac{n_{q}^{j}}{n_{q}}$ , where n is the total number of metavariants, n_q the total number of samples in cluster q, and $n_{q}^{j}$ the number of samples in cluster q belonging to the original species j.

To evaluate the accuracy of the (pairwise-F_ST) computed from the simulated data by metaVaR, we calculated the difference between all possible (pairwise-F_ST) for each bacteria computed by metaVaR to the expected pairwise-F_ST obtained with the alignment-based approach according to the pattern presented on Fig 3B where, for each bacteria, the abundance of the derived genome increases continuously from sample metaG1 to sample metaG7.

To test the sample size effect on the metaVaR performances, we sampled all possible combinations of three to six metagenomes from the seven initially simulated metagenomes. Then, we ran metaVaR using the following grid parameters e = (3, 4, 5, 6, 7) and p = (5, 8, 10, 12, 15, 20). For each run, we calculated the purity, entropy, precision and recall as previously described.

Results

Metavariant species as a new modelling of organisms from metagenomic data

In the absence of a reference genome to guide metagenomic data analyses for population genomics, we model species only by their variable loci. They are characterized by their associated depths of coverage and variant frequencies across environmental samples. We called this model metavariant species or MVS, and we proposed a method for constructing MVSs from multisample raw metagenomic data (Fig 1). The method is based on reference-free variant calling using metagenomic reads from different samples by DiscoSNP++ (Fig 2A).

In the context of metagenomics, the variants are termed metavariants and clustered into MVSs. The metavariants are clustered by multiple density-based clustering (mDBSCAN) based on the covariation of the depth of coverage of the corresponding variable loci across samples (Fig 2B). Clusters are then scored statistically based on the expected depth of coverage of the variable loci they contain in each sample (Fig 2C). The best clusters are selected by a maximum-weighted independent set (WMIN) algorithm (Fig 2D) and variable loci with a minimal coverage threshold are selected to obtain the final MVSs (Fig 2E). The method was implemented in a R package called metaVaR that allows users to build and manipulate MVSs for population genomic analyses (Fig 2F). The relevance of the MVSs and metaVaR was tested for population genomics use.

Metavariant species on simulated metagenomic data

To test the relevance of MVSs for population genomic analyses, we simulated seven metagenomic data sets composed of Illumina short reads from an admixture of six bacteria in various abundances (Fig 3A). Each bacterial species was composed of two strains in various abundances (Fig 3B), with a continuous increase of the derived strains abundance from metaG1 to metaG7. Metavariants were detected by DiscoSNP++ and filtered, giving 90,593 metavariants from which the depths of coverage of the variable loci and metavariant frequency matrices were computed.

The metavariants were clustered into MVS candidates using the mDBSCAN-WMIN algorithm, and the clustering performances were compared with the performances of three state-of-the-art algorithms [31]: (i) centered log-ratio and kmeans (CLR-kmeans); (ii) arcsin transform + Gaussian mixture model (arcsin-GMM); (iii) logit transform + Gaussian mixture model (logit-GMM).

The clustering performances of the four algorithms are summarized in Table 1 and illustrated in Fig 4. Overall, the mDBSCAN-WMIN algorithm had the highest precision, signal to noise ratio (STN), purity and entropy. mDBSCAN-WMIN was slightly but not significantly less sensitive than CLR-kmeans (Fig 4A), (paired U-test, P ≥ 0.05). mDBSCAN-WMIN was significantly more precise than the three other algorithms (Fig 4B) and had a significantly higher STN (Fig 4C) (paired U-test, P ≤ 0.05). Moreover, three of the mDBSCAN-WMIN clusters out of the six contained zero false positives. The arcsin-GMM and logit-GMM methods showed significantly lower recall (paired U-test, P ≤ 0.05).

The six clusters selected by mDBSCAN-WMIN correspond to three different DBSCAN parameter settings. Four clusters generated with e = 6, p = 8 were selected, one cluster for e = 6, p = 5 and one for e = 7, p = 12. This is a good illustration that with our scoring method there is not one single couple (e, p) that gives the highest-scoring metavariant clusters.

The samples size effect on the metaVaR performances was tested by sampling several times different number of metagenomes (Fig 5). We observed that the precision and purity increase with the number of metagenomes used as input while the recall and entropy decrease. The MVSs corresponding to the six initial species can be found using a minimum of five metagenomic samples.

The accuracy of the pairwise-F_ST estimation based on MVS was evaluated for the four clustering algorithms and compared with the accuracy obtained using metavariant alignment on the six bacterial genomes (Fig 4D). Pairwise-F_ST estimated on clusters from the four algorithms showed negligible differences with the reference-based approach for $Δ_{F_{S T}} \leq 0.01$ . However, the CLR-kmeans algorithm showed the lowest $Δ_{F_{S T}}$ values.

Metavariant species on real metagenomic data

To evaluate the relevance of MVS on real metagenomic data generated from environmental samples containing more complex genomes than the bacterial samples, we selected five metagenomic marine samples known to contain the zooplankter Oithona nana in sufficient abundance for population genomic analyses [23].

We ran DiscoSNP++ on the raw data and generated 1,159,157 metavariants, filtered into 138,676 metavariants. The metavariants were clustered into MVSs using the same four clustering algorithms as previously tested for simulated data. MVSs corresponding to O. nana were identified by aligning the metavariants on its genome. mDBSCAN-WMIN and CLR-kmeans detected the O. nana MVS, but the two other methods generated the maximum number of MVSs allowed by the parameters (i.e 12 clusters), with no clusters assigned to O. nana. These two other methods are therefore not considered further (Table 2). The O. nana MVS built by mDBSCAN-WMIN is less complete but more accurate than that built by CLR-kmeans (Table 2). The metavariants that were not relocated on the O. nana may include missing parts of the genome assembly.

Table 2. Metavariant clustering performances on real metagenomic data.

Algorithm	Software	Number of cluster	Recall^*	Precision^*	Signal-To-Noise^*	Time CPU
mDBSCAN-WMIN	metaVaR	3	0.1592	0.8144	1691.18	1.48
CLR-kmeans	coseq	4	0.1662	0.7182.	99.79	4.5
arcsin-GMM	coseq	12	-	-	-	5.53
logit-GMM	coseq	12	-	-	-	4.9

Open in a new tab

*Performances for the O. nana cluster. Time CPU is in minutes.

The pairwise-F_ST matrices of the O. nana MVSs (Fig 6A) showed small differences in relation to alignment-based F_ST values ( $Δ_{F_{S T}} \leq 0.03$ ) (Fig 6B). To illustrate potential MVS applications, we performed several downstream analyses, including isolation-by-distance (IBD) (Fig 6C), species co-differentiation (Fig 6D), and natural selection tests (Fig 6E and 6F). In the Mediterranean, the Lagrangian distances between the western and eastern basin sampling sites (S10, 11, 12 and S24, 26 respectively) can to a large extent explain the O. nana genomic differentiation (Mantel r = 0.73, p − value ≤ 0.05) (Fig 6C) [32]. Comparing the genomic differentiation trends between three MVSs (detected by the mDBSCAN-WMIN) revealed a negative correlation between MVS2 and MVS3, but no co-differentiation patterns between other MVS pairs.

Loci under selection in Mediterranean populations of O. nana were identified based on LK outlier values produced by the MVS-based and the alignment-based methods. For all three approaches, the LK distribution suggests neutral evolution at most of the O. nana polymorphic sites (Fig 6E). The prediction of loci under selection based on extreme LK values in the three approaches showed that all loci predicted under selection by the alignment-based approach are identified in the O. nana MVSs. Plus, more loci are identified under selection by both clustering methods, which differs only for one loci (Fig 6F).

Loci characterized by extreme LK values were positively detected with the three approaches. The 12 loci predicted under selection by the alignment-based approach are identified in the O. nana MVSs. More loci are identified under selection by both clustering methods, which differs only for one loci (Fig 6F).

Discussion

Modeling species’ nucleotide polymorphism by metavariant species

Large-scale nucleotide polymorphism detection traditionally requires reference sequences such as a genome or transcriptome assemblies. In most species, these resources are lacking, which greatly reduces the scope of population genomic investigations, since the choice of species is limited to those that have been sequenced. The small number of species that have a transcriptome or genome reference are unrepresentative of the whole genomic landscape of small eukaryote-rich biomes found in the oceans in particular [2]. MVS representation allows this obstacle to be overcome, making it possible to investigate a much larger number of species, including unknown species for which few or no genomic resources are available. However, if a reference sequence is available for a targeted species, we recommend also to use classical reference and alignment-based approaches.

MVS modelling nevertheless still requires a minimal amount of genomic information, which includes the variable loci of single copy regions and their variant frequencies in different samples. This information is sufficient for basic population genomic analyses, such as genomic differentiation and detection of loci under natural selection. The extraction of this information from raw metagenomic data through the use of DiscoSNP++ does not need a reference or assembly, and generates accurate variant frequencies in a reasonable time using computational resources, even for very large datasets [23].

Our results showed a large amount of false negative metavariants that where not assigned to any MVS using both mDBSCAN-WMIN or other coseq clustering algorithms. This can be explained by the fact that we used only the metagenomic abundance. The reads depth of coverage distribution is known to be over-dispersed when using Illumina sequencing [26], which disables to cluster all loci into MVC in the simulated data and a fortiori in real data. However, we proved that the accuracy of the population genomic metrics is not affected by this false negative rate (or low recall) and metaVaR estimates accurate F_ST with both simulated and real data.

To ease the MVS-based population genomic analysis, we selected only the biallelic loci which defacto reduces the complexity of the analysis but allows to compute directly the allele frequency instead of working on the four nucleotides frequencies.

As the allele frequency estimation bias depends greatly on the depth of coverage. High depth of coverage (>20x) allows more accurate allele frequency estimation, but such coverage are not always reached in metagenomic samples. However, here the selection of loci with at least 8x is sufficient to provide useful information for accurate allele frequency estimation.

In order to gain an ecological insight from MVSs, we recommend performing a taxonomic assignment of the MVSs, which can be done by aligning the variable loci sequences generated by DiscoSNP++ against public sequence databases. Due to their short length, metavariants can be aligned using classical short read aligners. Such taxonomic assignment of metavariants has already been successfully performed on real large metagenomic data set [33]. Moreover, for users with specific interest in a particular species, genome or transcriptome references can be used directly during the variant calling step using only DiscoSNP++ with no need of metaVaR.

The challenge of metavariant classification by species

The accuracy of the genomic differentiation estimation depends on the sample size and the number of markers. Having an exhaustive set of SNPs is not mandatory, but a large set (> 1,000 SNPs) is preferable [34]. The number of metavariants required for an MVS to be considered valid is critical. Moreover, the loci under selection often represent a small fraction of the genome. Increasing the number of metavariants in an MVS can help to detect these loci, but it is crucial to avoid false positives that generate biased F_ST values. In metagenomics, false positives are metavariants assigned to a species that they do not belong to. F_ST values derived from mis-assigned metavariants may generate outliers and support false signals of natural selection. For this reason precision may be deemed to be the criterion with the highest priority. Clustering the metavariants using mDBSCAN-WMIN and CLR-kmeans gave the best clustering results on simulated and real data, but mDBSCAN-WMIN is nonetheless more specific but less sensitive. Both methods have also their own particular limitations. With mDBSCAN-WMIN a variety of clustering parameters (e, p) generally needs to be tried. For example, values for e ranging from 0.1 to 1 and p from 10 to 100 can be initially tried. However, several runs will often be necessary to obtain all possible MVSs. CLR-kmeans involves trying different values of k with no prior knowledge, and the optimum value of k may be missed.

From our simulations, the number of metagenomic samples appeared to be critical to find all the possible MVSs and it seems reasonable to use at least six metagenomic samples to obtain MVSs with a high precision. However, the between samples homogeneity of the species relative abundance may also affect the performance of the method. Meaning that two species whose abundance covariates may be clustered in the same MVS. To cope with this problem, increasing the number of metagenomic samples remains a good option. However, as demonstrated by our simulations and due to metavariants filtering, the increasing number of samples will also lead to a decrease of the recall.

Toward a holistic view of microorganism genomic differentiation and natural selection

Current population genomic analyses focus on one single species at a time for which we do not have a sequence reference. Thanks to MVSs, the genomic differentiation of several species without genomic reference can be modelled simultaneously, and hypotheses like isolation-by-distance can be tested on each species. The genomic differentiation of MVSs can be compared and species sharing common differentiation profiles can be identified to illustrate possible co-differentiation or similar gene flow. However, even if the method is able to retrieve several MVSs from multiple samples, this will never cover the whole complexity within and among samples. Indeed, only few species present a sufficient sequencing depth of coverage and enough metavariants to pass the quality filters in real samples.

To give an idea of the potential of metaVaR on real large metagenomic data, we tested our method on four data sets containing several millions of metavariants from a total of 114 samples produced in a previous study [23] using Tara Oceans data and we obtained a total of 113 MVSs (Supplementary Result 1 in S1 File). The investigations of these MVSs will hopefully lead to new knowledge on plankton molecular evolution.

Another relevant MVS application concerns natural selection. The ratio of loci under selection over the total number of variable loci is an interesting metric for estimating the impact of natural selection on the molecular evolution of a species. This ratio can be computed for each MVS and the different ratios compared in order to assess the relative effect of natural selection.

Current limitations and future developments for metavariant species-based population genomics

Pairwise-F_ST currently remains a robust metric for tracing the silhouette of the genomic differentiation from metagenomic data. The absence of genotypes and haplotypes and their relative frequencies precludes intra-population analysis, and makes it impossible to compute p-values for F_ST. Moreover, the use of population genomic tools enabling the estimation of nucleotide diversity, the identification of genomic structure, and the testing of evolutionary trajectories and past demographic events is not yet possible. For these reasons, future developments focusing on variant phasing and haplotyping from metagenomic data will greatly help to improve MVS applications. In this context, the use of long read sequencing technologies will be of great benefit, by supporting long-range haplotypes spanning several kilobases.

Conclusion

MVSs make it feasible to carry out population genomic analyses of unknown organisms without a reference genome or genome assembly. MVSs are suitable for genomic differentiation and natural selection analysis. Simultaneous access to nucleotide polymorphisms of different species present in the same ecosystem allows for a holistic view of microorganism genomic differentiation and adaptation. Future developments will attempt to reconstruct species haplotypes based on metavariant species, in order to provide a more accurate view of species evolution.

Supporting information

S1 File

(PDF)

Click here for additional data file.^{(120KB, pdf)}

Acknowledgments

Our thanks to David Vallenet, Guillaume Gautreau and Adelme Bazin for helpful discussions.

Data Availability

Data related to this study has been uploaded to GitHub: https://github.com/madoui/metaVaR.

Funding Statement

The authors received no specific funding for this work.

References

1. Chariton A. Environmental DNA: For Biodiversity Research and Monitoring. Mol Ecol. 2019;28(20):4549–4551. 10.1111/mec.15235 [DOI] [Google Scholar]
2. Carradec Q, Pelletier E, Da Silva C, Alberti A, Seeleuthner Y, Blanc-Mathieu R, et al. A global ocean atlas of eukaryotic genes. Nat Commun. 2018;9(1):373 10.1038/s41467-017-02342-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Ibarbalz FM, Henry N, Brandão MC, Martini S, Busseni G, Byrne H, et al. Global Trends in Marine Plankton Diversity across Kingdoms of Life. Cell. 2019;179(5):1084–1097.e21. 10.1016/j.cell.2019.10.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Vogel TM, Simonet P, Jansson JK, Hirsch PR, Tiedje JM, Van Elsas JD, et al. TerraGenome: a consortium for the sequencing of a soil metagenome. Nat Rev Microbiol. 2009;7(252). [Google Scholar]
5. Ehrlich SD, Consortium M, et al. MetaHIT: The European Union Project on metagenomics of the human intestinal tract In: Metagenomics of the human body. Springer; 2011. p. 307–316. [Google Scholar]
6. Delmont T, Kiefl E, Kilinc O, Esen O, Uysal I, Rappé M, et al. Single-amino acid variants reveal evolutionary processes that shape the biogeography of a global SAR11 subclade. Elife. 2019;8:e46497 10.7554/eLife.46497 [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Schloissnig S, Arumugam M, Sunagawa S, Mitreva M, Tap J, Zhu A, et al. Genomic variation landscape of the human gut microbiome. Nature. 2013;493(7430):45–50. 10.1038/nature11711 [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Garud NR, Pollard KS. Population Genetics in the Human Microbiome. Trends in Genetics. 2020;36(1):53–67. 10.1016/j.tig.2019.10.010. [DOI] [PubMed] [Google Scholar]
9. Ellegaard K, Engel P. Genomic diversity landscape of the honey bee gut microbiota. Nat Commun. 2019;10(446). 10.1038/s41467-019-08303-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Salazar G, Paoli L, Alberti A, Huerta-Cepas J, Ruscheweyh HJ, Cuenca M, et al. Gene Expression Changes and Community Turnover Differentially Shape the Global Ocean Metatranscriptome. Cell. 2019;179(5):1068–1083.e21. 10.1016/j.cell.2019.10.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Madoui M, Poulain J, Sugier K, Wessner M, Noel B, Berline L, et al. New insights into global biogeography, population structure and natural selection from the genome of the epipelagic copepod Oithona. Mol Ecol. 2017;26(17):4467–4482. 10.1111/mec.14214 [DOI] [PubMed] [Google Scholar]
12. Costea P, Munch R, Coelho L, Paoli L, Sunagawa S, Bork P. metaSNV: A tool for metagenomic strain level analysis. PloS One. 2017;12(7):e0182392 10.1371/journal.pone.0182392 [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Sedlar K, Kupkova K, Provaznik I. Bioinformatics strategies for taxonomy independent binning and visualization of sequences in shotgun metagenomics. Comput Struct Biotechnol J. 2016;15:48–55. 10.1016/j.csbj.2016.11.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Parks D, Rinke C, Chuvochina M, Chaumeil P, Woodcroft B, Evans P, et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat Microbiol. 2017;2:1533–1542. 10.1038/s41564-017-0012-7 [DOI] [PubMed] [Google Scholar]
15. Stewart R, Auffret M, Warr A, Walker A, Roehe R, Watson M. Compendium of 4,941 rumen metagenome-assembled genomes for rumen microbiome biology and enzyme discovery. Nat Biotechnol. 2019;37:953–961. 10.1038/s41587-019-0202-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Huson D, Albrecht B, Bağcı C, Bessarab I, Górska A, Jolic D, et al. MEGAN-LR: new algorithms allow accurate binning and easy interactive exploration of metagenomic long reads and contigs. Biol Direct. 2018;13(1). 10.1186/s13062-018-0208-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Somerville V, Lutz S, Schmid M, Frei D, Moser A, Irmler S, et al. Long-read based de novo assembly of low-complexity metagenome samples results in finished genomes and reveals insights into strain diversity and an active phage system. BMC Microbiol. 2019;19(1):143 10.1186/s12866-019-1500-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Prezza N, Pisanti N, Sciortino M, Rosone G. SNPs detection by eBWT positional clustering. Algorithms Mol Biol. 2019;14 10.1186/s13015-019-0137-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Li Y, Lin Y. Kmer2SNP: reference-free SNP calling from raw reads based on matching. bioRxiv. 2020. 10.1101/2020.05.17.100305 [DOI] [Google Scholar]
20. Uricaru R, Rizk G, Lacroix V, Quillery E, Plantard O, Chikhi R, et al. Reference-free detection of isolated SNPs. Nucleic Acids Research. 2015;43(2):e11 10.1093/nar/gku1187 [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Standage D, Brown C, Hormozdiari F. Kevlar: A Mapping-Free Framework for Accurate Discovery of De Novo Variants. iScience. 2019;18:28–36. 10.1016/j.isci.2019.07.032 [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Fang H, Bergmann E, Arora K, Vacic V, Zody M, Iossifov I, et al. Indel variant analysis of short-read sequencing data with Scalpel. Nat Protoc. 2016;11:2529–2548. 10.1038/nprot.2016.150 [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Arif M, Gauthier J, Sugier K, Iudicone D, Jaillon O, Wincker P, et al. Discovering Millions of Plankton Genomic Markers from the Atlantic Ocean and the Mediterranean Sea. Mol Eco Res. 2019;19(2):526–535. 10.1111/1755-0998.12985 [DOI] [PubMed] [Google Scholar]
24. Sangwan N, Xia F, Gilbert J. Recovering complete and draft population genomes from metagenome datasets. Microbiome. 2016;4(8). 10.1186/s40168-016-0154-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Campello RJGB, Moulavi D, Sander J. Density-Based Clustering Based on Hierarchical Density Estimates. In: Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery in Databases. vol. 7. Berlin: Springer; 2013. p. 160.
26. Robinson D, Smyth G. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007;23(21):2881–2887. 10.1093/bioinformatics/btm453 [DOI] [PubMed] [Google Scholar]
27. Delignette-Muller ML, Dutang C. fitdistrplus: An R package for fitting distributions. J Stat Softw. 2015;64(4):1–34. 10.18637/jss.v064.i04 [DOI] [Google Scholar]
28. Sakai S, Togasaki M, Yamazaki K. A note on greedy algorithms for the maximum weighted independent set problem. Discrete Applied Mathematics. 2003;126(2):313–322. 10.1016/S0166-218X(02)00205-6. [DOI] [Google Scholar]
29. Lewontin R, Krakauer J. Distribution of gene frequency as a test of the theory of the selective neutrality of polymorphisms. Genetics. 1973;74(1):175–195. [DOI] [PMC free article] [PubMed] [Google Scholar]
30. Richter D, Ott F, Auch A, Schmid R, Huson D. MetaSim—A Sequencing Simulator for Genomics and Metagenomics. Plos one. 2008;3(10):e3373 10.1371/journal.pone.0003373 [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Rau A, Maugis-Rabusseau C. Transformation and model choice for RNA-seq co-expression analysis. Brief Bioinform. 2018;19(3):425–436. 10.1093/bib/bbw128 [DOI] [PubMed] [Google Scholar]
32. Mantel N. The detection of disease clustering and a generalized regression approach. Cancer Research. 1967;27(2):209–220. [PubMed] [Google Scholar]
33. Laso-Jadart R, Sugier K, Petit E, Labadie K, Peterlongo P, Ambroise C, et al. Investigating population-scale allelic differential expression in wild populations of Oithona similis (Cyclopoida, Claus, 1866). Ecology and Evolution. 2020;10(16):8894–8905. 10.1002/ece3.6588. [DOI] [PMC free article] [PubMed] [Google Scholar]
34. Willing EM, Dreyer C, van Oosterhout C. Estimates of genetic differentiation measured by F(ST) do not necessarily require large sample sizes when using many SNP markers. PloS ONE. 2012;7(8):e42649 10.1371/journal.pone.0042649 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0244637.r001

Decision Letter 0

Francisco Rodriguez-Valera

13 Nov 2020

PONE-D-20-25895

metaVaR: introducing metavariant species models for reference-free metagenomic-based population genomics

PLOS ONE

Dear Dr. Madoui,

First of all apologies for the long delay in assessing your manuscript. It is hard to find reviewers for bioinformatics manuscripts. Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Dec 19 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Francisco Rodriguez-Valera

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please amend your list of authors on the manuscript to ensure that each author is linked to an affiliation. Authors’ affiliations should reflect the institution where the work was done (if authors moved subsequently, you can also list the new affiliation stating “current affiliation:….” as necessary).

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: In the article "metaVAR: introducing metavariant species models for reference-free metagenomic-based population genomics", the authors present a method to retrieve polymorphism data from metagenome samples without the need for a reference genome, based on performing variant calling directly on the raw reads, then grouping the different variants into "Metavariant Species" (MVS) based on the variant coverage.

The method is mathematically sound and it certainly has a place in current metagenomic analyses. Although the production of Metagenome Assembled Genomes (MAGs) is widespread on the field, there are still a lot of samples for which the assembly of contigs to build MAGs is still difficult (One such example is any sample where one species of microorganism is specially abundant, such as salterns). However, I see a few issues that need to be addressed before this article can be published.

One of them is the overall readability of the article, specially the introduction. Although there are not any grammatical or spelling errors, there are some phrases that should be rewritten to make it easier to understand. I would bring special attention to lines 11-16, 50-53 and 256-275, but the entire article could do with an additional punctuation pass. The figures could do with a bit of retouching: Figure 1 in particular has too much information, including pictures referring to different sections of the article. Offloading some of the subfigures to Supplementary data would make the figure easier to parse. I am also missing a figure explaining in a graphical manner the concept of variable loci and metavariants: understanding these concepts is vital to understand the method, but the definition in the text is correct but a tad too formal for someone without proper mathematics formation, and a good figure could help biologists understand the core concepts of the article. The Algorithm 1 could also be removed from the article, as it is not required to understand the method and could be moved to Supplementary Data.

My main gripe with the article, however, is the application of this method to real metagenomic analyses. I'm particularly worried about the following issues:

*The scoring of each metavariant cluster depends on the number of samples, but metagenomic studies do not usually include replicates and the amount of samples sequenced is usually very low. How does sample size affect metavariant clustering? How many samples are needed to obtain a good separation of clusters? Does the homogeneity of the samples (How similar are they to one another, in terms of population composition) affect the scoring somehow?

* I am particularly troubled about the number of metavariants the method is able to recover: the simulated metagenomic dataset test only uses 6 genomes, and the real metagenomic test focuses only on a single genome. A real metagenomic sample is going to have a lot more genomes. How many is the tool able to recover? If the tool is only able to recover a limited number of genomes per run, is it possible to direct the tool to recover an specific genome?

* Although I understand the motivation of building MVS, a biologist using this tool still needs a way to connect a MVS to a genome, which the article does not include.

If these issues are adressed, then I have no doubt metaVaR has a bright future as a tool in population genomics.

Reviewer #2: In this manuscript the authors introduce a new method to detecting micro-diversity in metagenomic datasets. They also introduce the concept of metavariants and metavariant species. Their tool, MetaVaR was tested on both real and simulated datasets and presented superior performance. The manuscript is relevant and well written, the experiments are well designed and the results support the conclusions. Nevertheless some minor changes and clarifications are necessary to make this manuscript suitable for publication.

Ln 71: It seems to me that considering only a single metavariant might be a major limitation of the proposed method, albeit necessary to make the computations feasible. This should be mentioned in the discussion.

Ln 73: Would it not be more logical to establish the reference as the variant that has higher coverage?

Ln 113: Which distance metric was used with the DBSCAN algorithm? Did you test different metrics?

Ln 189-191: This setup is an overly simplistic representation of metagenomes. In most real world datasets metagenomes are made up of many more species and often multiple strains of the same species. Using only six species and no multiple strains from the same species is likely to inflate the precision and skew the other evaluation metrics.

Ln 195: Perhaps the term “communities” is more adequate than “populations” considering you are dealing with different species.

Figure 1C: Is there evidence to support that the DBSCAN noise points are all the result of inter-species variation or is this one of your assumptions? In case of the latter, this should be explicitly stated along with the rationale behind it.

Figure 2A: Is it correct to assume that coverage values refer to specific positions within genomes rather than coverage of the whole genome? If so, please clarify

Ln 253: word missing after evolutionary?

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Dec 30;15(12):e0244637. doi: 10.1371/journal.pone.0244637.r002

Author response to Decision Letter 0

3 Dec 2020

Dear editor and reviewers,

I really appreciate the time spent on reviewing our study. We took all your remark in consideration and redacted a complete response point by point to satify your legitimate concerns. We hope the effecort we provided fill fill your expectations. The manuscript contains major changes and new analysis and we hope the readers will find a great interest on our work. A detailed answer is provided in a seperate file.

Regards

Best regards,

The authors

Attachment

Submitted filename: Rebuttal_letter.pdf

Click here for additional data file.^{(101.8KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0244637.r003

Decision Letter 1

Francisco Rodriguez-Valera

15 Dec 2020

metaVaR: introducing metavariant species models for reference-free metagenomic-based population genomics

PONE-D-20-25895R1

Dear Dr. Madoui,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Francisco Rodriguez-Valera

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Sorry for the delays

Reviewers' comments:

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 File

(PDF)

Click here for additional data file.^{(120KB, pdf)}

Attachment

Submitted filename: Rebuttal_letter.pdf

Click here for additional data file.^{(101.8KB, pdf)}

Data Availability Statement

Data related to this study has been uploaded to GitHub: https://github.com/madoui/metaVaR.

[pone.0244637.ref001] 1. Chariton A. Environmental DNA: For Biodiversity Research and Monitoring. Mol Ecol. 2019;28(20):4549–4551. 10.1111/mec.15235 [DOI] [Google Scholar]

[pone.0244637.ref002] 2. Carradec Q, Pelletier E, Da Silva C, Alberti A, Seeleuthner Y, Blanc-Mathieu R, et al. A global ocean atlas of eukaryotic genes. Nat Commun. 2018;9(1):373 10.1038/s41467-017-02342-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244637.ref003] 3. Ibarbalz FM, Henry N, Brandão MC, Martini S, Busseni G, Byrne H, et al. Global Trends in Marine Plankton Diversity across Kingdoms of Life. Cell. 2019;179(5):1084–1097.e21. 10.1016/j.cell.2019.10.008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244637.ref004] 4. Vogel TM, Simonet P, Jansson JK, Hirsch PR, Tiedje JM, Van Elsas JD, et al. TerraGenome: a consortium for the sequencing of a soil metagenome. Nat Rev Microbiol. 2009;7(252). [Google Scholar]

[pone.0244637.ref005] 5. Ehrlich SD, Consortium M, et al. MetaHIT: The European Union Project on metagenomics of the human intestinal tract In: Metagenomics of the human body. Springer; 2011. p. 307–316. [Google Scholar]

[pone.0244637.ref006] 6. Delmont T, Kiefl E, Kilinc O, Esen O, Uysal I, Rappé M, et al. Single-amino acid variants reveal evolutionary processes that shape the biogeography of a global SAR11 subclade. Elife. 2019;8:e46497 10.7554/eLife.46497 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244637.ref007] 7. Schloissnig S, Arumugam M, Sunagawa S, Mitreva M, Tap J, Zhu A, et al. Genomic variation landscape of the human gut microbiome. Nature. 2013;493(7430):45–50. 10.1038/nature11711 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244637.ref008] 8. Garud NR, Pollard KS. Population Genetics in the Human Microbiome. Trends in Genetics. 2020;36(1):53–67. 10.1016/j.tig.2019.10.010. [DOI] [PubMed] [Google Scholar]

[pone.0244637.ref009] 9. Ellegaard K, Engel P. Genomic diversity landscape of the honey bee gut microbiota. Nat Commun. 2019;10(446). 10.1038/s41467-019-08303-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244637.ref010] 10. Salazar G, Paoli L, Alberti A, Huerta-Cepas J, Ruscheweyh HJ, Cuenca M, et al. Gene Expression Changes and Community Turnover Differentially Shape the Global Ocean Metatranscriptome. Cell. 2019;179(5):1068–1083.e21. 10.1016/j.cell.2019.10.014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244637.ref011] 11. Madoui M, Poulain J, Sugier K, Wessner M, Noel B, Berline L, et al. New insights into global biogeography, population structure and natural selection from the genome of the epipelagic copepod Oithona. Mol Ecol. 2017;26(17):4467–4482. 10.1111/mec.14214 [DOI] [PubMed] [Google Scholar]

[pone.0244637.ref012] 12. Costea P, Munch R, Coelho L, Paoli L, Sunagawa S, Bork P. metaSNV: A tool for metagenomic strain level analysis. PloS One. 2017;12(7):e0182392 10.1371/journal.pone.0182392 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244637.ref013] 13. Sedlar K, Kupkova K, Provaznik I. Bioinformatics strategies for taxonomy independent binning and visualization of sequences in shotgun metagenomics. Comput Struct Biotechnol J. 2016;15:48–55. 10.1016/j.csbj.2016.11.005 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244637.ref014] 14. Parks D, Rinke C, Chuvochina M, Chaumeil P, Woodcroft B, Evans P, et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat Microbiol. 2017;2:1533–1542. 10.1038/s41564-017-0012-7 [DOI] [PubMed] [Google Scholar]

[pone.0244637.ref015] 15. Stewart R, Auffret M, Warr A, Walker A, Roehe R, Watson M. Compendium of 4,941 rumen metagenome-assembled genomes for rumen microbiome biology and enzyme discovery. Nat Biotechnol. 2019;37:953–961. 10.1038/s41587-019-0202-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244637.ref016] 16. Huson D, Albrecht B, Bağcı C, Bessarab I, Górska A, Jolic D, et al. MEGAN-LR: new algorithms allow accurate binning and easy interactive exploration of metagenomic long reads and contigs. Biol Direct. 2018;13(1). 10.1186/s13062-018-0208-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244637.ref017] 17. Somerville V, Lutz S, Schmid M, Frei D, Moser A, Irmler S, et al. Long-read based de novo assembly of low-complexity metagenome samples results in finished genomes and reveals insights into strain diversity and an active phage system. BMC Microbiol. 2019;19(1):143 10.1186/s12866-019-1500-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244637.ref018] 18. Prezza N, Pisanti N, Sciortino M, Rosone G. SNPs detection by eBWT positional clustering. Algorithms Mol Biol. 2019;14 10.1186/s13015-019-0137-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244637.ref019] 19. Li Y, Lin Y. Kmer2SNP: reference-free SNP calling from raw reads based on matching. bioRxiv. 2020. 10.1101/2020.05.17.100305 [DOI] [Google Scholar]

[pone.0244637.ref020] 20. Uricaru R, Rizk G, Lacroix V, Quillery E, Plantard O, Chikhi R, et al. Reference-free detection of isolated SNPs. Nucleic Acids Research. 2015;43(2):e11 10.1093/nar/gku1187 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244637.ref021] 21. Standage D, Brown C, Hormozdiari F. Kevlar: A Mapping-Free Framework for Accurate Discovery of De Novo Variants. iScience. 2019;18:28–36. 10.1016/j.isci.2019.07.032 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244637.ref022] 22. Fang H, Bergmann E, Arora K, Vacic V, Zody M, Iossifov I, et al. Indel variant analysis of short-read sequencing data with Scalpel. Nat Protoc. 2016;11:2529–2548. 10.1038/nprot.2016.150 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244637.ref023] 23. Arif M, Gauthier J, Sugier K, Iudicone D, Jaillon O, Wincker P, et al. Discovering Millions of Plankton Genomic Markers from the Atlantic Ocean and the Mediterranean Sea. Mol Eco Res. 2019;19(2):526–535. 10.1111/1755-0998.12985 [DOI] [PubMed] [Google Scholar]

[pone.0244637.ref024] 24. Sangwan N, Xia F, Gilbert J. Recovering complete and draft population genomes from metagenome datasets. Microbiome. 2016;4(8). 10.1186/s40168-016-0154-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244637.ref025] 25.Campello RJGB, Moulavi D, Sander J. Density-Based Clustering Based on Hierarchical Density Estimates. In: Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery in Databases. vol. 7. Berlin: Springer; 2013. p. 160.

[pone.0244637.ref026] 26. Robinson D, Smyth G. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007;23(21):2881–2887. 10.1093/bioinformatics/btm453 [DOI] [PubMed] [Google Scholar]

[pone.0244637.ref027] 27. Delignette-Muller ML, Dutang C. fitdistrplus: An R package for fitting distributions. J Stat Softw. 2015;64(4):1–34. 10.18637/jss.v064.i04 [DOI] [Google Scholar]

[pone.0244637.ref028] 28. Sakai S, Togasaki M, Yamazaki K. A note on greedy algorithms for the maximum weighted independent set problem. Discrete Applied Mathematics. 2003;126(2):313–322. 10.1016/S0166-218X(02)00205-6. [DOI] [Google Scholar]

[pone.0244637.ref029] 29. Lewontin R, Krakauer J. Distribution of gene frequency as a test of the theory of the selective neutrality of polymorphisms. Genetics. 1973;74(1):175–195. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244637.ref030] 30. Richter D, Ott F, Auch A, Schmid R, Huson D. MetaSim—A Sequencing Simulator for Genomics and Metagenomics. Plos one. 2008;3(10):e3373 10.1371/journal.pone.0003373 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244637.ref031] 31. Rau A, Maugis-Rabusseau C. Transformation and model choice for RNA-seq co-expression analysis. Brief Bioinform. 2018;19(3):425–436. 10.1093/bib/bbw128 [DOI] [PubMed] [Google Scholar]

[pone.0244637.ref032] 32. Mantel N. The detection of disease clustering and a generalized regression approach. Cancer Research. 1967;27(2):209–220. [PubMed] [Google Scholar]

[pone.0244637.ref033] 33. Laso-Jadart R, Sugier K, Petit E, Labadie K, Peterlongo P, Ambroise C, et al. Investigating population-scale allelic differential expression in wild populations of Oithona similis (Cyclopoida, Claus, 1866). Ecology and Evolution. 2020;10(16):8894–8905. 10.1002/ece3.6588. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0244637.ref034] 34. Willing EM, Dreyer C, van Oosterhout C. Estimates of genetic differentiation measured by F(ST) do not necessarily require large sample sizes when using many SNP markers. PloS ONE. 2012;7(8):e42649 10.1371/journal.pone.0042649 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

metaVaR: Introducing metavariant species models for reference-free metagenomic-based population genomics

Romuald Laso-Jadart

Christophe Ambroise

Pierre Peterlongo

Mohammed-Amin Madoui

Roles

Abstract

Introduction

Methods

From metavariants to metavariant species

Variable loci and metavariants

Fig 1. Metavariants from environmental samples and metagenomic high-throughput sequencing.

Clustering of metavariants into metavariant species

Metavariant species construction steps and algorithms

Fig 2. Metavariant species construction from metagenomic data with metaVaR.

Multiple density-based clustering of metavariants

Scoring of metavariant clusters

Metavariant species as a maximum-weighted independent set

Selection of metavariant clusters as metavariant species

MVS-based population genomic analysis

Implementation of the mDBSCAN-WMIN algorithm in metaVaR

Metavariants from simulated metagenomic data

Fig 3. Simulated seven-metagenomic dataset on an admixture of six bacterial species containing within-species single nucleotide polymorphism.

Metavariants from real metagenomic data

Table 1. Metavariant clustering performances on simulated metagenomic data.

Comparison of mDBSCAN-WMIN to other sequence abundance-based clustering algorithms

Results

Metavariant species as a new modelling of organisms from metagenomic data

Metavariant species on simulated metagenomic data

Fig 4. Metavariant clustering performances of four clustering algorithms.

Fig 5. Sample size effect on metavariant clustering performances.

Metavariant species on real metagenomic data

Table 2. Metavariant clustering performances on real metagenomic data.

Fig 6. Examples of applications and accuracy of metavariant species.

Discussion

Modeling species’ nucleotide polymorphism by metavariant species

The challenge of metavariant classification by species

Toward a holistic view of microorganism genomic differentiation and natural selection

Current limitations and future developments for metavariant species-based population genomics

Conclusion

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Francisco Rodriguez-Valera

Roles

Author response to Decision Letter 0

Decision Letter 1

Francisco Rodriguez-Valera

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases