Unitig level assembly graph based metagenome-assembled genome refiner (UGMAGrefiner): A tool to increase completeness and resolution of metagenome-assembled genomes

Baoyu Xiang; Liping Zhao; Menghui Zhang

doi:10.1016/j.csbj.2023.03.030

. 2023 Mar 21;21:2394–2404. doi: 10.1016/j.csbj.2023.03.030

Unitig level assembly graph based metagenome-assembled genome refiner (UGMAGrefiner): A tool to increase completeness and resolution of metagenome-assembled genomes

Baoyu Xiang ¹, Liping Zhao ¹, Menghui Zhang ^1,^⁎

PMCID: PMC10091015 PMID: 37066122

Abstract

De novo assembly of next generation metagenomic reads is widely used to provide taxonomic and functional information of genomes in a microbial community. As strains are functionally specific, recovery of strain-resolved genomes is important but still a challenge. Unitigs and assembly graphs are mid-products generated during the assembly of reads into contigs, and they provide higher resolution for sequences connection information. In this study, we propose a new approach UGMAGrefiner (a unitig level assembly graph-based metagenome-assembled Genome refiner), which uses the connection and coverage information from unitig level assembly graphs to recruit unbinned unitigs to MAGs, adjust binning result, and infer unitigs shared by multiple MAGs. In two simulated datasets (Simdata and CAMI data) and one real dataset (GD02), it outperforms two state-of-the-art assembly graph-based binning refine tools in the refinement of MAGs’ quality by stably increasing the completeness of genomes. UGMAGrefiner can identify genome specific clusters of genomes with below 99% average nucleotide identity for homologous sequences. For MAGs mixed with 99% similarity genome clusters, it could distinguish 8 out of 9 genomes in Simdata and 8 out of 12 genomes in CAMI data. In GD02 data, it could identify 16 new unitig clusters representing genome specific regions of mixed genomes and 4 unitig clusters representing new genomes from total 135 MAGs for further functional analysis. UGMAGrefiner provides an efficient way to obtain more complete MAGs and study genome specific functions. It will be useful to improve taxonomic and functional information of genomes after de novo assembly.

Keywords: Metagenome, Metagenomic assembly, Assembly graph, Binning refinement, Genome specific unitig cluster

Graphical Abstract

Highlights

•
UGMAGrefiner can improve the completeness of metagenome-assembled genomes.
•
UGMAGrefiner can identify genome specific sequences from similar genomes which are always missed in de novo assembly.
•
UGMAGrefiner has compatible computational requirements and is easy to use.

1. Introduction

Metagenomic sequencing studies all genetic materials from the genomes existed in an environment. With the development of sequencing technologies, shotgun metagenomic sequencing provides an efficient way to study the composition and function of microbes in a defined environment and can be used to identify new microbes which might not be cultivated [1].

To investigate the organisms and their functions presenting in a complex microbial community, we can map reads to the predefined references using Kraken [2], [3], [4], bioBakery3 [5], etc. These mapping-based methods can provide a good view of the composition, especially for low abundant organisms, and can estimate functions of the community. But mapping-based methods are restricted from novel strains when the reference databases do not contain their information. De novo assembly can discover novel strains along with their functions through metagenome-assembled genomes (MAGs). In general, de novo assembly first assembles sequence reads into longer contigs which contain more genomic information, and then bins the contigs into different groups based on each contig’s composition or/and abundance, resulting in MAGs [6]. The mapping-based methods and the de novo assembly methods are complementary and are widely used to explore the microbial communities in various environments [7], [8], [9], [10].

Since different species or even different strains have different functions [11], obtaining high-quality MAGs is crucial for distinguishing between these strains. Many efforts have been made to improve the completeness, decrease the contamination, and improve the resolution of MAGs. Binning is an important step in de novo assembly which clusters fragmented contigs into different groups, representing different microbial genomes. There exist many binning tools with different categories, such as CONCOCT [12], MaxBin2 [13], MATBAT2 [14], MetaDecoder [15], VAMB [16], MetaCRS [17], etc. These tools utilize the composition characteristics (such as GC contents, tetranucleotide frequency) and coverage information of each contig to perform unsupervised clustering with different methods. Other binning tools use references or impose some restrictions during binning to fit some special requirements such as SolidBin [18], vRhyme [19], SemiBin [20], StrainGE [21]. Tools like binning_refiner [22], DAS tool [23], and MetaWRAP [24] can improve the quality of MAGs by utilizing binning results from different tools. However MAGs are still face several challenges: (1) segmented contigs cause incompleteness of MAGs and increase difficulty in function annotation for some special regions, (2) most binning tools discard short contigs (<1 kb or 1.5 kb for different tools) which may increase the incompleteness of MAGs, (3) contigs from similar genomes are mixed in one MAG in such a way that only the shared or most abundant ones are remained, while the genome specific ones are not kept.

Assembly graph is generated when assembling reads to contigs using de Bruijn graph based methods like MEGAHIT [25] or metaSPAdes [26]. It contains linkage information among sequences which can be used to better predict genes [27] and improve binning result than segmented contigs. GraphBin [28] and GraphBin2 [29] can utilize this connection information and/or coverage information of contigs to adjust the assignments of binned contigs to MAGs, predict the MAGs of contigs discarded (unbinned contigs) by existing binning tools through label propagation, and infer contigs shared by multiple MAGs. METAMVGL [30] integrates assembly graphs and paired-end graphs which represent the shared paired-end reads between two contigs to refine the binning result from different tools in a uniform multi-view label propagation framework. STRONG [31] is capable to extract subgraphs of single-copy genes from co-assembly graph, which can be used to estimate the number of strains present in each MAG and strains’ haplotype or sequence on the single copy genes. These tools show the potential of using assembly graphs to refine binning result and improve the completeness of MAGs. However, the quality of newly added contigs, and the ability to identify genome specific sequences from similar genomes mixed in one MAG for further functional analysis still need to be improved.

Unitigs, formed by kmers, represent the sequences without branch in assembly graph. Several connected unitigs form a contig and one unitig can be present in multiple contigs. Unitig level assembly graph has a higher resolution than contig level assembly graph and can present directions of sequences’ linkage which is often ignored in contig assembly graph. Hence, we suppose that construction of assembly graph with unitigs instead of contigs may provide higher resolution of detailed connection information, resulting in improvement of binning result.

In this study, we propose a new approach called UGMAGrefiner, a unitig level assembly graph based metagenome-assembled genome refiner. The purpose of this tool is to increase completeness of MAGs and to distinguish genome specific parts which are always ignored when some similar genomes are mixed in one MAG. With two simulated datasets and one real data, we compared the performance of UGMAGrefiner with other two state-of-the-art assembly graph based pipelines, GraphBin2 [29] and METAMVGL [30]. We found that UGMAGrefiner can conservatively increase the completeness of MAGs and identify genome specific unitigs of mixed similar genomes when sequencing depth is enough. These results suggest that UGMAGrefiner is a useful tool to refine binning result for further functional analysis.

2. Materials and methods

2.1. Datasets

2.1.1. Simulated datasets

Simdata: a simulated data contains 7 samples, each has 14 genomes varied in degree of ANI (average nucleotide identity for homologous sequence) and in abundance. Briefly, 5 out of the 14 genomes belong to the same species with above 99% ANI, other 2 genomes belong to the same species with more than 95% ANI with these 5 genomes, and the remaining 7 genomes belong different species with ANI lower than 90%. The abundance of each genome was designed to be high or low with the abundance change across samples to be constant, increase and decrease. Each sample has 10 million of 2×150bp paired end reads simulated by the tool InSilicoSeq [32], modeling a NovaSeq instrument. Please refer Supplementary Material table S1 for further details. The Simdata can be accessed at the NCBI SRA database with accession number PRJNA899675.

CAMI data: a publicly available data selected from the 2nd CAMI Toy Human Microbiome Project Dataset (https://data.cami-challenge.org/participate) [33]. This data contains 10 samples from human gastrointestinal tract. Each sample contains 16 million 2×150 bp paired end reads.

In this paper we choose popularly used tools and pipelines to do de novo assembly. Briefly, fastp [34] (version v0.20.1) is firstly used to remove adapter and low quality reads. Then, de novo assembly is individually done for each sample by metaSPAdes [26] (version v3.15.3) and the produced assembly graphs at unitig level are kept for further use. Finally, binning is completed with Metawrap [24] (version 1.3.2) using binning result from MaxBin [35] (version 2.2.6), MetaBAT2 [14] (version 2.12.1) and CONCOCT [12] (version 1.0.0). Unless otherwise stated, the default parameters set in the applied tools are adopted.

2.1.2. GD02 data

A real metagenomic sequencing data collected by our lab from a child’s fecal samples at 7 time points, each sequenced by Illumina HiSeq 2000 platform with about 25 million 2×150 bp paired end reads [36]. This data can be accessed at the NCBI SRA database with accession number SRP045211. For the GD02 data, Trimmomatic [37] (v.0.39) was used to trim the adapters and to control the quality of the sequencing reads. The remaining reads that were able to be aligned to the human genome (Homo sapiens, UCSC hg19) with Bowtie2 [38] (v.2.3.5.1) were also removed. On average, 25.2 × 10⁶± 4.0 × 10⁶ (mean± SD) paired-end high-quality reads for each sample were retained and used for the downstream analysis.

2.2. Implementation

The workflow of UGMAGrefiner contains three modules (Fig. 1), including preprocessing of de novo assembly, recruitment of unbinned unitigs to MAGs, and identification of genome specific clusters.

2.2.1. Module 1: de novo assembly of sequencing reads

This module contains steps for read quality control, assembly, and binning. The quality control and binning can use any available tools, but the assembly step currently is restricted on metaSPAdes to get unitig information. If available, other assembling tools which can provide high resolution assembly graph at unitig level can also be incorporated.

2.2.2. Module 2: recruiting unbinned unitigs to MAGs

1）Construct all unitig level assembly graphs with direction.

We construct unitig level assembly graphs G (nodes, edges) from the mid-product that generated by metaSPAdes. Each node (N_i) in G represents an unitig which contains a head (N_{i -head}) and a tail (N_{i -tail}). An edge in G represents a link between two nodes, for instance, nodes n_i and n_j may have four edges (N_i-head, N_j-head), (N_i-head, N_j-tail), (N_i-tail, N_j-head) and (N_i-tail, N_j-tail).

2）Label nodes that have already been binned to MAGs.

For the nodes in MAGs, we mark them with the label of their source MAGs. In this step, some nodes may have more than one label since they appeared in multiple MAGs. The coverage of nodes is obtained from metaSPAdes. Meanwhile, the mean coverage of each MAG is also calculated based on coverage and length of its nodes.

3) Remove the labels of nodes with abnormal high coverage.

Some nodes may have abnormal high coverages than corresponding MAG as they existed in multiple MAGs but were incorrectly binned into the most abundant MAG. We remove the labels of nodes whose coverage is 1.5 times larger than those of corresponding MAGs. Here we set 1.5 as default parameter to determine abnormal nodes, and then re-consider their source MAG(s). The lower the parameter is, the more abnormal high abundance nodes we can find. This default parameter is suitable for almost all circumstances.

4) Give MAG label to unlabeled nodes.

This part is designed to circularly find potential MAG(s) label(s) for each unlabeled node (the loop in Fig. 1 part 2). Since each unlabeled node is performed independently, utilization of multiple CPUs is allowed to save time. The running of the loops stops till no more labels can be added to nodes or the number of loops reaches the pre-defined limit (default: 10). Firstly, we find all unlabeled nodes (see Algorithm 1 for details). Secondly, for each unlabeled node under consideration, we run Bread-First-Search to find its neighbor nodes which already have label(s) and decide which label(s) to give (see Algorithm 2 for details). As a result, an unlabeled node will have two lists about its labeled neighbor nodes start from its “head” and “tail”, each recording their labels and the coverages (the boxes named ‘left labeled nodes’ and ‘right labeled nodes’ in Fig. 1). Thirdly, we individually calculate the total coverage of labeled unitigs belonging to each MAG in the two boxes of Fig. 1 and merge the result. For example, if a MAG appears in both boxes, the one which has the less coverage is removed. Then we filter the unlabeled node to maintain the one whose coverage are below 1.5 times of corresponding MAG(s). Finally, we give corresponding label(s) to the nodes whose coverage foldchange is within a defined range (from 0.7 to 1.5 times of the corresponding MAGs’ average coverage. This threshold is inspired by the criterion used in GraphBin2 for determining inconsistent vertices).

After the loop ends, we calculate the length sum of newly added nodes for each MAG and remove newly added nodes clusters whose length sum longer than 1.5Mbp since they might be from a new genome instead of missing part of an existed MAG.

Algorithm 1

Give label for each unbinned node.

Open in a new tab

Algorithm 2

Bread-First-Search(runBFS).

Open in a new tab

2.2.3. Module3: identifying unique unitig clusters

During de novo assembly, unitigs from highly similar genomes are often mixed into one MAG, in which the common parts of these similar genomes or only the most abundant genome is remained, while the unique parts of the similar genomes are missed. To handle this problem, we reconstruct the unique sequences of similar genomes through simultaneously using the unitig’s connection information obtained from assembly graphs and the coverage evaluation with MAG. Firstly, we obtain the unitigs with a coverage below 0.7 times of MAG average coverage in the first loop of module 2 and remove the newly added unitigs generated in all loops by module 2 for each MAG. Secondly, we cluster the remained low coverage unitigs in each MAG using ‘GaussianMixture’ model. The number of clusters are chosen by obtaining the minimal Bayesian information criterion. Then, clusters with similar mean coverage within the range of (0.7,1.5) are furtherly merged (according to the threshold of determining inconsistent vertices in GraphBin2). Finally, since some genomes may have similar regions with multiple MAG, some unitigs connected with these regions may be present in multiple clusters from different MAGs. In that case we keep these unitigs in the cluster which has the maximum length to obtain the main clusters.

Next step is to classify the clusters into two types according to clusters’ mean coverage, total length and extend rate. Here, the extend rate for each cluster is calculated by Bread-First-Search algorithm, as the total length of unlabeled unitigs which can be recruited to clusters, divided by the corresponding cluster length. The higher the rate is, the greater chance that the cluster is from a new genome since it has many unitigs beyond the connection with initial MAG. The classify criteria are defined as (1) from a new genome if the cluster’s length> 1Mbp, or length between 100Kbp and 1Mbp & extend rate> 1; (2) represents genome specific sequences from mixed similar genomes if cluster’s length between 100Kbp and 1Mbp & extend rate< 1 & coverage> 10. The remaining clusters are discarded. Of note, unitig clusters representing genome specific sequences from mixed similar genomes are required to have at least 10x sequencing depth to ensure the reliability. This is because de novo assembly currently could hardly reconstruct genome below 10x sequencing depth [33].

In summary, module 3 of UGMAGrefiner can identify genome specific unitig clusters from genomes that are quite similar and mixed in one MAG. As these clusters’ sequences are always missed in general metagenome assembly, recruiting these clusters increases the ability to predict genome specific functions and improve the resolution for further metagenomic analysis.

2.3. Criteria for method performance evaluation

1) Evaluation of MAG refinement.

For Simdata and CAMI data, we used DNAdiff (version 1.3) from mummer [39] (version 4) to find each unitigs’ source genome by mapping unitigs to the reference genomes. We assigned a unitig to a genome when the mapping coverage of this unitig to the genome is bigger than 99% and we allowed one unitig belong to several genomes. For every MAG, we assigned the most abundant genome as it’s true source genome. We calculated precision, recall and F1 score for newly added unitigs to evaluate the refinement by the following Equations:

\begin{matrix} Precision = \frac{length of unitigs correctly added}{length of all unitigs added} \end{matrix}

(1)

\begin{matrix} Recall = \frac{length of unitigs correctly added}{length of genome - length of binned unitigs} \end{matrix}

(2)

\begin{matrix} F 1 score = \frac{2 * prec ision * recall}{precision + recall} \end{matrix}

(3)

We calculated the purity and completeness of each MAG before and after refinement as:

\begin{matrix} Purity = \frac{length of unitigs correctly binned}{length of MAG} \end{matrix}

(4)

\begin{matrix} Completeness = \frac{length of unitigs correctly binned}{length of source genome} \end{matrix}

(5)

In GD02, since the real composition is unknown, we only compared the length, completeness and contamination predicted by CheckM [40] for each MAG.

We used barrnap (version 0.9) [41] to estimate the number of rRNA in each MAG and GTDB-Tk (version 1.5.0 with database R202) to classify MAGs.

2) Evaluation of newly identified unitig clusters.

For each newly identified unitig cluster, the total length of unitigs belonging to the same genome was calculated separately and the cluster was assigned to the longest genome. Then, we calculated each unitig cluster’s purity as:

\begin{matrix} Unitig cluster ’ s purity = \frac{length of unitigs correctly binned}{length of source genome} \end{matrix}

(6)

To evaluate whether a cluster is truly from a mixed genome, we used dRep [42] to cluster all genomes in all samples and divided them into different groups based on 99% ANI similarity (secondary cluster level) or 90% ANI similarity (primary cluster level). If the assigned source genome of a unitig cluster was in the same primary or secondary cluster level with the MAG’s genome, this cluster was deemed to be separated from the mixed genome. Otherwise, this cluster was considered as from a new genome.

For the GD02 data, we evaluated the length of newly identified clusters and the number of coding sequences with Prokka (version 1.14.6) [43].

2.4. Computational requirements

All computations were completed on a computer equipped with intel(R) Xeon(R) CPU E5–2697 v3 @ 2.60 GHz and 500 GB of RAM. We recorded the running time and memory used for GraphBin2, METAMVGL and UGMAGrefiner on the three datasets.

3. Results

3.1. UGMAGrefiner could conservatively improve MAG’s completeness

3.1.1. Simdata and CAMI data

For the Simdata, the mean recall of UGMAGrefiner’ newly added unitigs was 53.4%, which was 7.9% bigger than METAMVGL but 14% lower than GraphBin2 (Fig. 2). For the more complex CAMI data, the recall of UGMAGrefiner behaved relatively conservative that was 8% lower than GraphBin2 and 15.2% lower than METAMVGL. However, UGMAGrefiner improved precision of added sequences by 28.4%, 45.7% in Simdata and 30.4%, 36.8% in CAMI data than METAMVGL and GraphBin2 respectively.

After binning steps by metaWRAP, the obtained original MAGs had high purity. Though GraphBin2 and METAMVGL are designed to improve the completeness of MAG at species level, they reduced MAG’s purity and completeness at genome level. UGMAGrefiner could increase completeness with a little reduce in purity (Table 1).

Table 1.

The mean purity and completeness of MAGs in Simdata and CAMI data.

	Simdata		CAMI data
	purity (%)	completeness (%)	purity (%)	completeness (%)
Original	99.1	95.4	97.6	90.3
GraphBin2	70.8	78.8	79.3	83.4
METAMVGL	83.8	77.3	73.7	63.7
UGMAGrefiner	98.7	98.0	92.6	97.0

Open in a new tab

3.1.2. GD02 data

For the GD02 data, GraphBin2 produced the longest MAGs after refinement (Table 2). GraphBin2 displayed good ability to increase lengths of MAGs. Whereas the performance of METAMVGL was unstable. The length increased by UGMAGrefiner ranged from 4% to 18%.

Table 2.

The total length of MAGs before and after the binning refinement on GD02 data.

sample	Before refinement (bp)	GraphBin2 (bp)	METAMVGL (bp)	UGMAGrefiner (bp)
S1	58100828	114859733	59680926	68772487
S2	55627138	94993561	51267811	62092911
S3	38814191	86329493	48990783	44542434
S4	71915370	120131523	54609593	81786642
S5	51830469	83460833	55045388	59559501
S6	38504349	43650510	31778568	40136708
S7	35113991	53914081	17154849	39987503
Length vs before refinement (mean± SD)	-	169 ± 32%	90 ± 23%	113 ± 4%

Open in a new tab

UGMAGrefiner could increase the length of MAG to improve the completeness with a much lower increase of the contamination (Table 3 and Fig. S1). The mean completeness of MAG with UGMAGrefiner was higher than original MAG, GraphBin2 and METAMVGL. Though the contamination of UGMAGrefiner was higher than that of before refinement, it was lower than those of METAMVGL and GraphBin2.

Table 3.

The completeness, contamination, and length of MAGs before and after the refinement on GD02 data.

Refinement tool	Completeness (%)	Length (Million bp)	Contamination (%)
Original	94.2 ± 6.8	2.6 ± 0.6	1.2 ± 1.4
GraphBin2	91.6 ± 12.8	4.4 ± 3.2	45.5 ± 88.6
METAMVGL	57.2 ± 32.2	2.4 ± 2.7	12.6 ± 41.7
UGMAGrefiner	95.6 ± 6.0	2.9 ± 0.7	6.7 ± 12.2

Open in a new tab

In a brief summary, UGMAGrefiner can improve the completeness of MAGs with a little increased contamination, which is always better than the other two state-of-the-art tools in the three datasets (Figure.S1). Meanwhile, GTDB-Tk analysis indicated that this increased contamination did not affect the taxonomic classifications of MAGs in the datasets.

3.2. Evaluation on newly identified unitig clusters

3.2.1. Simdata and CAMI data

Genome abundance and similarity among the genomes inside a microbial population are two important factors affecting the quality of the metagenomic binning. To better unravel the performance characteristics of UGMAGrefiner on Simdata and CAMI datasets, we divided the genomes of each sample into four groups (HC, HU, LC, and LU) based on their abundances and inter-similarities. HC (high common) indicates genomes with relative abundance> 1% and with similar genome(s) in the sample (ANI>90%), HU (high unique) indicates genomes with relative abundance> 1% and without similar genome in the sample, LC (low common) indicates genomes with relative abundance< 1% and with similar genome(s) in the sample, LU (low unique) indicates genomes with relative abundance< 1% and without similar genome in the sample. Here we choose a relative abundance threshold of 1% to divide genomes into different groups because this correspond to ∼10x sequencing depth in Simdata and CAMI data, and it has been reported that de novo assembly can hardly reconstruct genome below 10x sequencing depth [33]. As shown in Fig. 3, for the Simdata and CAMI data, most of HU genomes could be de novo binned, while only part of HC genomes could be de novo binned. Low abundance genomes were more likely to be missed in de novo assembly which was consistent with previously reported [33]. UGMAGrefiner could identify genomes mainly from HC group.

Fig. 3 — **The status of each genome after refinement in (A)Simdata and (B)CAMI data.** HC indicates genomes with relative abundance> 1% and with similar genome(s) in the sample (ANI>90%), HU indicates genomes with relative abundance> 1% and without similar genome in the sample, LC indicates genomes with relative abundance< 1% and with similar genome(s) in the sample, LU indicates genomes with relative abundance< 1% and without similar genome in a sample. Y-axis represents the number of genomes in each group. Different colors represent the status of each genome. “De novo Binned” represents genomes binned with de novo assembly, “Newly identified” represents low abundance sequences cluster generated from unbinned sequences with UGMAGrefiner, “Missed” represents genomes still not found after de novo assembly and UGMAGrefiner’s refinement.

In Simdata we identified 10 clusters from mixed similar genomes of same species with above 95% similarity and 2 clusters from new genomes (Table 4). The average purity of clusters was 90%, and all identified clusters were correctly classified as “Separated” or “New”. In CAMI data, we got 13 newly identified clusters with an average purity of 78%, and 9 among these clusters were correctly classified as “Separated” or “New”. The correctly classified “Separated” clusters were all from the same species with above 95% similarity with their corresponding source MAG. Two clusters with a length bigger than 1Mbp were “flase” classified as “New”. These might due to the reason that these two clusters have below 95% similarity with their corresponding source MAG.

Table 4.

Evaluation of newly identified unitig clusters by UGMAGrefiner on Simdata and CAMI data.

Sample	Unitig Cluster’s genome^a	Source MAG’s genome^a	Unitig cluster’s Length (bp)	Extend rate	Unitig cluster’s Purity (%)	Identified relation to source genome	T / F of identified relation
Simdata
S1	4_3	4_2	441523	0.04	49	Separated²	T
S1	4_1	4_2	103501	0.25	92	Separated	T
S2	4_1	4_1	113658	0.04	99	Separated	T
S2	4_2	4_1	138085	0.09	88	Separated	T
S3	4_2	4_1	160057	0.05	87	Separated	T
S4	4_2	4_1	124078	0.05	92	Separated	T
S4	4_1	4_1	119861	0.11	98	Separated	T
S5	4_2	4_1	253029	0.11	78	Separated	T
S5	6_0	5_0	1549187	0.01	99	New³	T
S6	4_2	4_1	190254	0.08	99	Separated	T
S6	6_0	5_0	1578234	0.00	100	New	T
S7	4_2	4_1	240035	0.02	99	Separated	T
CAMI
S1	47_1	46_1	136866	1.27	96	New	T
S4	47_1	46_1	1139558	0.07	99	New	T
S5	126_1	126_2	104259	0.00	100	Separated	T
S6	74_1	74_1	557558	0.02	73	Separated	T
S7	40_12	40_1	771803	0.14	65	Separated	T
S7	47_2	47_1	1234105	0.09	68	New	F
S8	51_1	46_1	235633	0.45	68	Separated	F
S8	40_4	40_4	304833	0.31	35	Separated	T
S8	47_1	47_2	1095395	0.06	79	New	F
S9	40_1	40_1	126727	0.50	77	Separated	T
S9	42_1	42_3	399230	0.00	97	Separated	T
S10	130_1	131_4	1500061	0.02	75	New	T
S10	49_0	46_1	468027	0.47	93	Separated	F

Open in a new tab

Unitig Cluster’s genome indicates the genome to which the newly identified untig cluster was assigned, in the 99% ANI cluster of the genomes generated by dRep. Source MAG’s genome indicates the genome from which the majority of the unitigs in the MAG were derived, also in the 99% ANI similarity cluster of genomes genereted by dRep.The number before “_” indicates the 90% similarity cluster of genomes. ²“Separated” means the identified cluster is from a mixed similar genome. ³“New” means the identified cluster is from a new unbinned genome.

Fig. 4 shows the Status of high abundance (>1% relative abundance) 99% similarity genomes cluster of each sample from Simdata and CAMI data after treatment by UGMAGrefiner. In Simdata, de novo assembly could get a MAG represent the most abundant genome in primary cluster 4 (primary cluster: a group of genomes with more than 90% ANI similarity), but missed the other genomes in the different secondary cluster (second cluster: a group of genomes with more than 99% ANI similarity). UGMAGrefiner could identify unique unitigs from other secondary clusters or even the different genomes in the same secondary cluster. There was only one secondary cluster could not be identified from mixed MAGs in S2 with a lower relative abundance. In CAMI data, there were 15 mixed primary clusters in total, 4 of them could not de novo assembled. UGMAGrefiner could identify 8 unique unitig clusters from 11 MAGs which had mixed similar genomes. UGMAGrefiner could not identify other 3 mixed primary clusters because they had low abundance or relative low abundance compared with their source MAG, such as “74_1″ in S1, “74_3″ in S2, “74_1″ in S4. This is because we had discarded the clusters with coverage below 10 and metaSPAdes which we used in de novo assembly step had disconnected unitigs in the assembly graph when the coverage of a unitig “A” was 10 times higher than that of another unitig (e.g. “B”). Among the 8 identified unitig clusters, 3 unitig clusters(“74_2″ in S6, “40_6″ in S8 and “40_11″ in S9) were mainly from the same 99% similarity cluster instead of from the similar genome. This phenomenon indicated that UGMAGrefienr could not well separate similar genomes when their abundance are similar.

3.2.2. GD02 data

UGMAGrefiner identified 2–6 clusters from mixed similar genomes or new genomes except for one sample (S6, see Table 5 for details). The newly identified clusters contained numerous new CDS (coding sequences) for further functional analysis and were helpful to distinguish similar genomes mixed during de novo assembly. The detailed status of each newly identified clusters is listed in supplementary material table S2.

Table 5.

The number of de novo MAGs, newly identified clusters and CDS in GD02.

	Number of De novo MAGs	Number of “Separated” clusters	Number of “New” clusters	Number of CDS in identified clusters
S1	21	4	2	3077
S2	21	2	2	2876
S3	15	3	0	866
S4	27	3	0	834
S5	21	3	0	553
S6	16	0	0	0
S7	14	1	0	241

Open in a new tab

3.3. Computational requirements

Table 6 shows the computational requirements of GraphBin2, METAMVGL and UGMAGrefienr in the three datasets. Both UGMAGrefiner and GraphBin2 can utilize multiple CPUs to accelerate the computation. METAMVGL needs to generate assembly and paired end graphs which may allocate more memory than UGMAGrefiner and GraphBin2, but its multi-view graph-based metagenomic contig binning step is much faster than UGMAGrefiner and GraphBin2. Overall, our tool has compatible computational requirements for metagenomic analysis.

Table 6.

Computational requirements.

		GraphBin2	METAMVGL	UGMAGrefiner
Simdata
	Number of CPU	8	1	8
	Time (second)	4	118	3
	Memory used（MB）	13	∼ 8000	40
CAMI
	Number of CPU	16	1	16
	Time (second)	592	323	154
	Memory used（MB）	268	∼ 12000	408
GD02
	Number of CPU	16	1	16
	Time (second)	1471	814	127
	Memory used（MB）	245	∼27800	351

Open in a new tab

Only the steps after obtaining the binning result are counted.

4. Discussion

In this study, we have developed a new pipeline - UGMAGrefiner, which makes use of the connection and coverage information of unitigs level assembly graphs to improve the binning result of de novo assembly. UGMAGrefiner could conservatively enhance the completeness of MAGs with a little sacrifice in contamination by retrieving parts of unitigs that were generally discarded by popularly used metagenomic binning tools and inferring unitigs shared by multiple MAGs. Furthermore, UGMAGrefiner could identify genome specific unitigs from same species which mixed in one MAG, enabling the study of unique functions of genomes mixed in one MAG.

In this experiment, we set two ANI similarity levels, 90% and 99%, to evaluate UGMAGrefine’s ability to distinguish genomes. Although 95% ANI are usually used to distinguish different species [44], we found some genomes below 95% ANI could not be distinguished by de novo assembly. For instance, 47_1 and 47_2 has 93% ANI similarity but still mixed in MAGs of sample S7 and S8. Whereas UGMAGrefiner could well distinguish these genomes. Furthermore, for more similar genomes, such as 4_1, 4_2 and 4_3 in Simdata that were in the same species with above 95% ANI, UGMAGrefienr could identify them from one mixed MAG. UGMAGrefiner was not good enough to distinguish genomes with above 99% ANI similarity (such as genomes mixed within 4_1 in Simdata), which might be caused by the limitation of assembler as genomes with above 99% ANI are hard to distinguish through de novo assembly [26], [31]. However, for complex metagenomic samples such as environmental metagenomes, close-related MAGs (ANI>95%) are often not well recovered by currently de novo assembly methods [33]. Thus, the limited resolution of UGMAGrefiner should not be a problem for such samples.

Repeats or conserved elements are well known to be difficult to assemble and binning [45]. In this work, we tried using barrnap to estimate the number of rRNA genes in each MAG or source genome with the three datasets. We found that the UGMAGrefiner indeed could add some rRNA genes to their original MAGs, but its adding accuracy was dataset dependent. For instance, in Simdata it was 92.5% while in CAMI data was only 50%. Furthermore, for all the datasets, the numbers of obtained rRNA genes were still much less than the real numbers in the source genomes (Table S3). However, though the UGMAGrefiner currently might not be good enough to assemble the highly conserved elements like rRNA genes from complex metagenome mixture, we believe long reads sequencing or specially designed tools have potential to solve this dilemma in the future.

UGMAGrefiner can find unitigs from some MAGs whose total length are bigger than 1.5 Mbp during unlabeled unitigs recruitment, as well as identify some “New” clusters during the low coverage unitig clustering step. Although these unitigs cannot be regarded as complete new genomes, they represent part of some new genomes and might be useful. On the other hand, since the clustering of unitigs for each MAG is based on unitigs’ coverage, UGMAGrefiner is hard to distinguish unitigs from similar genomes whose abundance are similar, resulting in the unique sequences from these genomes might still be mixed in one cluster. This limitation might be solved by using multiple samples information. Furthermore, UGMAGrefiner might be limited to identify genome specific regions from low or relative low abundance with corresponding MAGs for two reasons. One is that it discards the clusters with coverage below 10 to ensure the correcteness of clusters. The other is due to the limitation of metaSPAdes during the generation of assembly graph to remove the sequencing error, deal with repeat region and mask strain variation, as metaSPAdes disconnnets unitigs in assembly graph if the coverage of a unitig “A” is 10 times higher than that of unitig “B”.

The use of assembly graph with more accurate linkages or long sequences such as Hi-C or assembly of long reads [46], [47] might improve the resolution and get more complete MAGs. Since UGMAGrefiner performs recruitment of unbinned unitigs to MAGs after de novo assembly and never delete unitigs in the originally obtained MAGs, its refinement can be easily influenced if the MAGs are in low quality. For instance, if the original MAG contains sequences from multiple genomes, UGMAGrefiner might also recruit many unitigs from other genomes leading to lower precision and higher contamination. Therefore, before the use of UGMAGrefiner, we recommend using a binning refinement by MetaWRAP or DAS tool in order to generate MAGs with contamination< 10% and completeness> 70%.

In this study, to evaluate the performance of UGMAGrefiner, we selected two state-of-the-art graph-based assembly tools GraphBin2 and METAMVGL as comparative methods and the performance of refinement all started after getting the binning result. Surprisedly, these methods both reduced the purity and completeness of MAGs at genome level on the two simulated datasets. This phenomenon may be contributed by following factors: (1) the input binning result generated by metaWRAP had already filtered some MAGs through the criteria of completeness and contamination. Those filtered MAGs produce more unlabeled nodes in assembly graphs which might increase more false recruitment. (2) though GraphBin2 and METAMVGL perform well at species level [29], [30], but at genome level they might incorrectly transfer some sequences from one MAG to another, causing some MAGs have many sequences from multiple genomes whereas some other MAGs have few sequences left. These factors may lead to lower purity and completeness in MAG refinement.

5. Conclusions

The UGMAGrefiner we developed in this study is an unitig level assembly graph-based metagenome-assembled Genome refiner. UGMAGrefiner can conservatively increase the completeness of MAGs with a little increase on contamination and identify specific unitigs from similar genomes in same species (>95% ANI) which mixed in one MAG. UGMAGrefiner has potential to be widely used after de novo assembly to improve the completeness of MAG, obtain a better functional annotation, and improve the resolution of MAGs. It is helpful to study the special functions carried by the genome specific sequences which are often missed in de novo assembly.

Ethics approval and consent to participate

Not applicable.

Funding

This work was supported by Medicine and Engineering Interdisciplinary Research Fund of Shanghai Jiao Tong University, Grant/Award Number: YG2021QN29.

CRediT authorship contribution statement

M.Z., L.Z. and B.X. conceived the project, B.X. implemented the algorithm and wrote the manuscript, M.Z. wrote and revised the manuscript. All authors reviewed the manuscript. All authors read and approved the final manuscript.

Conflict of interest

The authors declare that no conflict of interests exist.

Acknowledgements

Not applicable.

Consent for publication

Not applicable.

Footnotes

^{Appendix A}

Supplementary data associated with this article can be found in the online version at doi:10.1016/j.csbj.2023.03.030.

Appendix A. Supplementary material

Supplementary material

mmc1.docx^{(702.8KB, docx)}

References

1.Quince C., Walker A.W., Simpson J.T., Loman N.J., Segata N. Shotgun metagenomics, from sampling to analysis. Nat Biotechnol. 2017;35(9):833–844. doi: 10.1038/nbt.3935. [DOI] [PubMed] [Google Scholar]
2.Wood D.E., Salzberg S.L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15(3) doi: 10.1186/gb-2014-15-3-r46. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Wood D.E., Lu J., Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol. 2019;20(1):257. doi: 10.1186/s13059-019-1891-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Lu J., Rincon N., Wood D.E., Breitwieser F.P., Pockrandt C., et al. Metagenome analysis using the Kraken software suite. Nat Protoc. 2022 doi: 10.1038/s41596-022-00738-y. 10.1038/s41596-022-00738-y. 10.1038/s41596-022-00738-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Beghini F., McIver L.J., Blanco-Miguez A., Dubois L., Asnicar F., et al. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. Elife. 2021 doi: 10.7554/eLife.65088. 10. 10.7554/eLife.65088. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Kashaf S.S., Almeida A., Segre J.A., Finn R.D. Recovering prokaryotic genomes from host-associated, short-read shotgun metagenomic sequencing data. Nat Protoc. 2021;16(5):2520–2541. doi: 10.1038/s41596-021-00508-2. [DOI] [PubMed] [Google Scholar]
7.Danko D., Bezdan D., Afshin E.E., Ahsanuddin S., Bhattacharya C., et al. A global metagenomic map of urban microbiomes and antimicrobial resistance. Cell. 2021;184(13):3376–3393. doi: 10.1016/j.cell.2021.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Chen C., Zhou Y., Fu H., Xiong X., Fang S., et al. Expanded catalog of microbial genes and metagenome-assembled genomes from the pig gut microbiome. Nat Commun. 2021;12(1):1106. doi: 10.1038/s41467-021-21295-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Almeida A., Nayfach S., Boland M., Strozzi F., Beracochea M., et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat Biotechnol. 2021;39(1):105–114. doi: 10.1038/s41587-020-0603-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Kang D.R., Huang Y.H., Nesme J., Herschend J., Jacquiod S., et al. Metagenomic analysis of a keratin-degrading bacterial consortium provides insight into the keratinolytic mechanisms. Sci Total Environ. 2021:761. doi: 10.1016/j.scitotenv.2020.143281. [DOI] [PubMed] [Google Scholar]
11.Van Rossum T., Ferretti P., Maistrenko O.M., Bork P. Diversity within species: interpreting strains in microbiomes. Nat Rev Microbiol. 2020;18(9):491–506. doi: 10.1038/s41579-020-0368-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Alneberg J., Bjarnason B.S., de Bruijn I., Schirmer M., Quick J., et al. Binning metagenomic contigs by coverage and composition. Nat Methods. 2014;11(11):1144–1146. doi: 10.1038/nmeth.3103. [DOI] [PubMed] [Google Scholar]
13.Wu Y.W., Simmons B.A., Singer S.W. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics. 2016;32(4):605–607. doi: 10.1093/bioinformatics/btv638. [DOI] [PubMed] [Google Scholar]
14.Kang D.D., Li F., Kirton E., Thomas A., Egan R., et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ. 2019;7 doi: 10.7717/peerj.7359. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Liu C.C., Dong S.S., Chen J.B., Wang C., Ning P., et al. MetaDecoder: a novel method for clustering metagenomic contigs. Microbiome. 2022;10(1) doi: 10.1186/s40168-022-01237-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Nissen J.N., Johansen J., Allesoe R.L., Sonderby C.K., Armenteros J.J.A., et al. Improved metagenome binning and assembly using deep variational autoencoders. Nat Biotechnol. 2021 doi: 10.1038/s41587-020-00777-4. 10.1038/s41587-020-00777-4. [DOI] [PubMed] [Google Scholar]
17.Jiang Z.J., Li X.B., Guo L.J. MetaCRS: unsupervised clustering of contigs with the recursive strategy of reducing metagenomic dataset's complexity. Bmc Bioinforma. 2022;22(Suppl 12) doi: 10.1186/s12859-021-04227-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Wang Z.Y., Wang Z.Y., Lu Y.Y., Sun F.Z., Zhu S.F. SolidBin: improving metagenome binning with semi-supervised normalized cut. Bioinformatics. 2019;35(21):4229–4238. doi: 10.1093/bioinformatics/btz253. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Kieft K., Adams A., Salamzade R., Kalan L., Anantharaman K. vRhyme enables binning of viral genomes from metagenomes. Nucleic Acids Res. 2022 doi: 10.1093/nar/gkac341. 10.1093/nar/gkac341. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Pan S.J., Zhu C.K., Zhao X.M., Coelho L.P. A deep siamese neural network improves metagenome-assembled genomes in microbiome datasets across different environments. Nat Commun. 2022;13(1) doi: 10.1038/s41467-022-29843-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.van Dijk L.R., Walker B.J., Straub T.J., Worby C.J., Grote A., et al. StrainGE: a toolkit to track and characterize low-abundance strains in complex microbial communities. Genome Biol. 2022;23(1):0. doi: 10.1186/s13059-022-02630-. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Song W.Z., Thomas T. Binning_refiner: improving genome bins through the combination of different binning programs. Bioinformatics. 2017;33(12):1873–1875. doi: 10.1093/bioinformatics/btx086. [DOI] [PubMed] [Google Scholar]
23.Sieber C.M.K., Probst A.J., Sharrar A., Thomas B.C., Hess M., et al. Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy. Nat Microbiol. 2018;3(7):836–843. doi: 10.1038/s41564-018-0171-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Uritskiy G.V., DiRuggiero J., Taylor J. MetaWRAP-a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome. 2018;6(1):158. doi: 10.1186/s40168-018-0541-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Li D., Luo R., Liu C.M., Leung C.M., Ting H.F., et al. MEGAHIT v1.0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods. 2016;102:3–11. doi: 10.1016/j.ymeth.2016.02.020. [DOI] [PubMed] [Google Scholar]
26.Nurk S., Meleshko D., Korobeynikov A., Pevzner P.A. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 2017;27(5):824–834. doi: 10.1101/gr.213959.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Dvorkina T., Bankevich A., Sorokin A., Yang F., Adu-Oppong B., et al. ORFograph: search for novel insecticidal protein genes in genomic and metagenomic assembly graphs. Microbiome. 2021;9(1) doi: 10.1186/s40168-021-01092-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Mallawaarachchi V., Wickramarachchi A., Lin Y. GraphBin: refined binning of metagenomic contigs using assembly graphs. Bioinformatics. 2020;36(11):3307–3313. doi: 10.1093/bioinformatics/btaa180. [DOI] [PubMed] [Google Scholar]
29.Mallawaarachchi V.G., Wickramarachchi A.S., Lin Y. Improving metagenomic binning results with overlapped bins using assembly graphs. Algorithms Mol Biol. 2021;16(1):6. doi: 10.1186/s13015-021-00185-. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Zhang Z.M., Zhang L. METAMVGL: a multi-view graph-based metagenomic contig binning algorithm by integrating assembly and paired-end graphs. Bmc Bioinforma. 2021;22(Suppl 10) doi: 10.1186/s12859-021-04284-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Quince C., Nurk S., Raguideau S., James R., Soyer O.S., et al. Vol. 22. 2021. STRONG: metagenomics strain resolution on assembly graphs. (Genome Biology). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Gourle H., Karlsson-Lindsjo O., Hayer J., Bongcam-Rudloff E. Simulating Illumina metagenomic data with InSilicoSeq. Bioinformatics. 2019;35(3):521–522. doi: 10.1093/bioinformatics/bty630. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Meyer F., Fritz A., Deng Z.L., Koslicki D., Lesker T.R., et al. Critical Assessment of Metagenome Interpretation: the second round of challenges. Nat Methods. 2022 doi: 10.1038/s41592-022-01431-4. 10.1038/s41592-022-01431-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Chen S.F., Zhou Y.Q., Chen Y.R., Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34(17):884–890. doi: 10.1093/bioinformatics/bty560. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Wu Y.W., Tang Y.H., Tringe S.G., Simmons B.A., Singer S.W. MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome. 2014:2. doi: 10.1186/2049-2618-2-26. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Zhang C., Yin A., Li H., Wang R., Wu G., et al. Dietary modulation of gut microbiota contributes to alleviation of both genetic and simple obesity in children. EBioMedicine. 2015;2(8):968–984. doi: 10.1016/j.ebiom.2015.07.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Bolger A.M., Lohse M., Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30(15):2114–2120. doi: 10.1093/bioinformatics/btu170. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Langmead B., Salzberg S.L. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–U54. doi: 10.1038/Nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Marcais G., Delcher A.L., Phillippy A.M., Coston R., Salzberg S.L., et al. MUMmer4: a fast and versatile genome alignment system. Plos Comput Biol. 2018;14(1) doi: 10.1371/journal.pcbi.1005944. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Parks D.H., Imelfort M., Skennerton C.T., Hugenholtz P., Tyson G.W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25(7):1043–1055. doi: 10.1101/gr.186072.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Seemann T. barrnap 0.9: rapid ribosomal RNA prediction. Available from: https://github.com/tseemann/barrnap. Accessed 2023 May 8.
42.Olm M.R., Brown C.T., Brooks B., Banfield J.F. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J. 2017;11(12):2864–2868. doi: 10.1038/ismej.2017.126. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014;30(14):2068–2069. doi: 10.1093/bioinformatics/btu153. [DOI] [PubMed] [Google Scholar]
44.Olm M.R., Crits-Christoph A., Diamond S., Lavy A., Carnevali P.B.M., et al. Consistent metagenome-derived metrics verify and delineate bacterial species boundaries. Msystems. 2020;5(1) doi: 10.1128/mSystems.00731-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Gruber-Vodicka H.R., Seah B.K.B., Pruesse E. phyloFlash: rapid small-subunit rRNA profiling and targeted assembly from metagenomes. Msystems. 2020;5(5) doi: 10.1128/mSystems.00920-20. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.DeMaere M.Z., Darling A.E. bin3C: exploiting Hi-C sequencing data to accurately resolve metagenome-assembled genomes. Genome Biol. 2019:20. doi: 10.1186/s13059-019-1643-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Du Y.X., Sun F.Z. HiCBin: binning metagenomic contigs and recovering metagenome-assembled genomes using Hi-C contact maps. Genome Biol. 2022;23(1) doi: 10.1186/s13059-022-02626-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

mmc1.docx^{(702.8KB, docx)}

[bib1] 1.Quince C., Walker A.W., Simpson J.T., Loman N.J., Segata N. Shotgun metagenomics, from sampling to analysis. Nat Biotechnol. 2017;35(9):833–844. doi: 10.1038/nbt.3935. [DOI] [PubMed] [Google Scholar]

[bib2] 2.Wood D.E., Salzberg S.L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15(3) doi: 10.1186/gb-2014-15-3-r46. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Wood D.E., Lu J., Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol. 2019;20(1):257. doi: 10.1186/s13059-019-1891-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Lu J., Rincon N., Wood D.E., Breitwieser F.P., Pockrandt C., et al. Metagenome analysis using the Kraken software suite. Nat Protoc. 2022 doi: 10.1038/s41596-022-00738-y. 10.1038/s41596-022-00738-y. 10.1038/s41596-022-00738-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Beghini F., McIver L.J., Blanco-Miguez A., Dubois L., Asnicar F., et al. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. Elife. 2021 doi: 10.7554/eLife.65088. 10. 10.7554/eLife.65088. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Kashaf S.S., Almeida A., Segre J.A., Finn R.D. Recovering prokaryotic genomes from host-associated, short-read shotgun metagenomic sequencing data. Nat Protoc. 2021;16(5):2520–2541. doi: 10.1038/s41596-021-00508-2. [DOI] [PubMed] [Google Scholar]

[bib7] 7.Danko D., Bezdan D., Afshin E.E., Ahsanuddin S., Bhattacharya C., et al. A global metagenomic map of urban microbiomes and antimicrobial resistance. Cell. 2021;184(13):3376–3393. doi: 10.1016/j.cell.2021.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] 8.Chen C., Zhou Y., Fu H., Xiong X., Fang S., et al. Expanded catalog of microbial genes and metagenome-assembled genomes from the pig gut microbiome. Nat Commun. 2021;12(1):1106. doi: 10.1038/s41467-021-21295-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Almeida A., Nayfach S., Boland M., Strozzi F., Beracochea M., et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat Biotechnol. 2021;39(1):105–114. doi: 10.1038/s41587-020-0603-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Kang D.R., Huang Y.H., Nesme J., Herschend J., Jacquiod S., et al. Metagenomic analysis of a keratin-degrading bacterial consortium provides insight into the keratinolytic mechanisms. Sci Total Environ. 2021:761. doi: 10.1016/j.scitotenv.2020.143281. [DOI] [PubMed] [Google Scholar]

[bib11] 11.Van Rossum T., Ferretti P., Maistrenko O.M., Bork P. Diversity within species: interpreting strains in microbiomes. Nat Rev Microbiol. 2020;18(9):491–506. doi: 10.1038/s41579-020-0368-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] 12.Alneberg J., Bjarnason B.S., de Bruijn I., Schirmer M., Quick J., et al. Binning metagenomic contigs by coverage and composition. Nat Methods. 2014;11(11):1144–1146. doi: 10.1038/nmeth.3103. [DOI] [PubMed] [Google Scholar]

[bib13] 13.Wu Y.W., Simmons B.A., Singer S.W. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics. 2016;32(4):605–607. doi: 10.1093/bioinformatics/btv638. [DOI] [PubMed] [Google Scholar]

[bib14] 14.Kang D.D., Li F., Kirton E., Thomas A., Egan R., et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ. 2019;7 doi: 10.7717/peerj.7359. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] 15.Liu C.C., Dong S.S., Chen J.B., Wang C., Ning P., et al. MetaDecoder: a novel method for clustering metagenomic contigs. Microbiome. 2022;10(1) doi: 10.1186/s40168-022-01237-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16.Nissen J.N., Johansen J., Allesoe R.L., Sonderby C.K., Armenteros J.J.A., et al. Improved metagenome binning and assembly using deep variational autoencoders. Nat Biotechnol. 2021 doi: 10.1038/s41587-020-00777-4. 10.1038/s41587-020-00777-4. [DOI] [PubMed] [Google Scholar]

[bib17] 17.Jiang Z.J., Li X.B., Guo L.J. MetaCRS: unsupervised clustering of contigs with the recursive strategy of reducing metagenomic dataset's complexity. Bmc Bioinforma. 2022;22(Suppl 12) doi: 10.1186/s12859-021-04227-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] 18.Wang Z.Y., Wang Z.Y., Lu Y.Y., Sun F.Z., Zhu S.F. SolidBin: improving metagenome binning with semi-supervised normalized cut. Bioinformatics. 2019;35(21):4229–4238. doi: 10.1093/bioinformatics/btz253. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19.Kieft K., Adams A., Salamzade R., Kalan L., Anantharaman K. vRhyme enables binning of viral genomes from metagenomes. Nucleic Acids Res. 2022 doi: 10.1093/nar/gkac341. 10.1093/nar/gkac341. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] 20.Pan S.J., Zhu C.K., Zhao X.M., Coelho L.P. A deep siamese neural network improves metagenome-assembled genomes in microbiome datasets across different environments. Nat Commun. 2022;13(1) doi: 10.1038/s41467-022-29843-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] 21.van Dijk L.R., Walker B.J., Straub T.J., Worby C.J., Grote A., et al. StrainGE: a toolkit to track and characterize low-abundance strains in complex microbial communities. Genome Biol. 2022;23(1):0. doi: 10.1186/s13059-022-02630-. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] 22.Song W.Z., Thomas T. Binning_refiner: improving genome bins through the combination of different binning programs. Bioinformatics. 2017;33(12):1873–1875. doi: 10.1093/bioinformatics/btx086. [DOI] [PubMed] [Google Scholar]

[bib23] 23.Sieber C.M.K., Probst A.J., Sharrar A., Thomas B.C., Hess M., et al. Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy. Nat Microbiol. 2018;3(7):836–843. doi: 10.1038/s41564-018-0171-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] 24.Uritskiy G.V., DiRuggiero J., Taylor J. MetaWRAP-a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome. 2018;6(1):158. doi: 10.1186/s40168-018-0541-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] 25.Li D., Luo R., Liu C.M., Leung C.M., Ting H.F., et al. MEGAHIT v1.0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods. 2016;102:3–11. doi: 10.1016/j.ymeth.2016.02.020. [DOI] [PubMed] [Google Scholar]

[bib26] 26.Nurk S., Meleshko D., Korobeynikov A., Pevzner P.A. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 2017;27(5):824–834. doi: 10.1101/gr.213959.116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] 27.Dvorkina T., Bankevich A., Sorokin A., Yang F., Adu-Oppong B., et al. ORFograph: search for novel insecticidal protein genes in genomic and metagenomic assembly graphs. Microbiome. 2021;9(1) doi: 10.1186/s40168-021-01092-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib28] 28.Mallawaarachchi V., Wickramarachchi A., Lin Y. GraphBin: refined binning of metagenomic contigs using assembly graphs. Bioinformatics. 2020;36(11):3307–3313. doi: 10.1093/bioinformatics/btaa180. [DOI] [PubMed] [Google Scholar]

[bib29] 29.Mallawaarachchi V.G., Wickramarachchi A.S., Lin Y. Improving metagenomic binning results with overlapped bins using assembly graphs. Algorithms Mol Biol. 2021;16(1):6. doi: 10.1186/s13015-021-00185-. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib30] 30.Zhang Z.M., Zhang L. METAMVGL: a multi-view graph-based metagenomic contig binning algorithm by integrating assembly and paired-end graphs. Bmc Bioinforma. 2021;22(Suppl 10) doi: 10.1186/s12859-021-04284-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] 31.Quince C., Nurk S., Raguideau S., James R., Soyer O.S., et al. Vol. 22. 2021. STRONG: metagenomics strain resolution on assembly graphs. (Genome Biology). [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] 32.Gourle H., Karlsson-Lindsjo O., Hayer J., Bongcam-Rudloff E. Simulating Illumina metagenomic data with InSilicoSeq. Bioinformatics. 2019;35(3):521–522. doi: 10.1093/bioinformatics/bty630. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] 33.Meyer F., Fritz A., Deng Z.L., Koslicki D., Lesker T.R., et al. Critical Assessment of Metagenome Interpretation: the second round of challenges. Nat Methods. 2022 doi: 10.1038/s41592-022-01431-4. 10.1038/s41592-022-01431-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] 34.Chen S.F., Zhou Y.Q., Chen Y.R., Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34(17):884–890. doi: 10.1093/bioinformatics/bty560. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib35] 35.Wu Y.W., Tang Y.H., Tringe S.G., Simmons B.A., Singer S.W. MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome. 2014:2. doi: 10.1186/2049-2618-2-26. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] 36.Zhang C., Yin A., Li H., Wang R., Wu G., et al. Dietary modulation of gut microbiota contributes to alleviation of both genetic and simple obesity in children. EBioMedicine. 2015;2(8):968–984. doi: 10.1016/j.ebiom.2015.07.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib37] 37.Bolger A.M., Lohse M., Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30(15):2114–2120. doi: 10.1093/bioinformatics/btu170. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib38] 38.Langmead B., Salzberg S.L. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–U54. doi: 10.1038/Nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib39] 39.Marcais G., Delcher A.L., Phillippy A.M., Coston R., Salzberg S.L., et al. MUMmer4: a fast and versatile genome alignment system. Plos Comput Biol. 2018;14(1) doi: 10.1371/journal.pcbi.1005944. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib40] 40.Parks D.H., Imelfort M., Skennerton C.T., Hugenholtz P., Tyson G.W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25(7):1043–1055. doi: 10.1101/gr.186072.114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib41] 41.Seemann T. barrnap 0.9: rapid ribosomal RNA prediction. Available from: https://github.com/tseemann/barrnap. Accessed 2023 May 8.

[bib42] 42.Olm M.R., Brown C.T., Brooks B., Banfield J.F. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J. 2017;11(12):2864–2868. doi: 10.1038/ismej.2017.126. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib43] 43.Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014;30(14):2068–2069. doi: 10.1093/bioinformatics/btu153. [DOI] [PubMed] [Google Scholar]

[bib44] 44.Olm M.R., Crits-Christoph A., Diamond S., Lavy A., Carnevali P.B.M., et al. Consistent metagenome-derived metrics verify and delineate bacterial species boundaries. Msystems. 2020;5(1) doi: 10.1128/mSystems.00731-19. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib45] 45.Gruber-Vodicka H.R., Seah B.K.B., Pruesse E. phyloFlash: rapid small-subunit rRNA profiling and targeted assembly from metagenomes. Msystems. 2020;5(5) doi: 10.1128/mSystems.00920-20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib46] 46.DeMaere M.Z., Darling A.E. bin3C: exploiting Hi-C sequencing data to accurately resolve metagenome-assembled genomes. Genome Biol. 2019:20. doi: 10.1186/s13059-019-1643-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib47] 47.Du Y.X., Sun F.Z. HiCBin: binning metagenomic contigs and recovering metagenome-assembled genomes using Hi-C contact maps. Genome Biol. 2022;23(1) doi: 10.1186/s13059-022-02626-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Unitig level assembly graph based metagenome-assembled genome refiner (UGMAGrefiner): A tool to increase completeness and resolution of metagenome-assembled genomes

Baoyu Xiang

Liping Zhao

Menghui Zhang

Abstract

Graphical Abstract

Highlights

1. Introduction

2. Materials and methods

2.1. Datasets

2.1.1. Simulated datasets

2.1.2. GD02 data

2.2. Implementation

Fig. 1.

2.2.1. Module 1: de novo assembly of sequencing reads

2.2.2. Module 2: recruiting unbinned unitigs to MAGs

Algorithm 1

Algorithm 2

2.2.3. Module3: identifying unique unitig clusters

2.3. Criteria for method performance evaluation

2.4. Computational requirements

3. Results

3.1. UGMAGrefiner could conservatively improve MAG’s completeness

3.1.1. Simdata and CAMI data

Fig. 2.

Table 1.

3.1.2. GD02 data

Table 2.

Table 3.

3.2. Evaluation on newly identified unitig clusters

3.2.1. Simdata and CAMI data

Fig. 3.

Table 4.

Fig. 4.

3.2.2. GD02 data

Table 5.

3.3. Computational requirements

Table 6.

4. Discussion

5. Conclusions

Ethics approval and consent to participate

Funding

CRediT authorship contribution statement

Conflict of interest

Acknowledgements

Consent for publication

Footnotes

Appendix A. Supplementary material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases