Table 3.
Introduction to softwares for illumina and nanopore‐integrated metagenomics
| Bioinformatic category of tools | Name of tool | Description | Reference |
|---|---|---|---|
| Nanopore‐alone assembly | Canu | Canu is a fork of the Celera Assembler, designed for noisy long reads produced by PacBio or nanopore sequencing. LR assembly of Canu runs in hierarchical steps of correct‐trim‐assembly. An adaptive overlapping strategy was applied to improve genome recovery efficiency. | [53] |
| metaFlye | De novo assembler for nanopore LR specifically designed to address important LR metagenomic assembly challenges. The uneven bacterial composition was addressed by introducing a metagenome k‐mer selection mode in which genomic k‐mers were selected based on a per‐read frequency threshold estimated based on error probability other than uniformed coverage threshold, while the intraspecies (strain‐level) heterogeneity was resolved by iteratively identifying the repetitive edges based on read‐path of the repeat graph. | [16] | |
| Miniasm | Miniasm is a very fast overlap layout consensus (OLC)‐based de novo assembler of noisy nanopore LRs. It takes all‐versus‐all LRs self‐mappings as input and generates an assembly graph in GFA format. Different from mainstream assemblers, Miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final contig, therefore the per‐base error rate of contigs is similar to the raw input LRs. | [52] | |
| It is not specifically optimized for metagenome assembly, therefore only the very dominant populations within a community could be assembled. | |||
| Wtdbg2 | De novo assembler for noisy PacBio and nanopore LRs. It assembles raw LRs without error correction and then builds the consensus from intermediated assembly output. Wtdbg2 chops read into 1024 bp segments, merges similar segments into a vertex and connects vertices based on the segment adjacency on reads resulting in a fuzzy Bruijn graph (FBG), which is akin De Bruijn graph but permits mismatches/gaps and keeps read paths when collapsing k‐mers. | [76] | |
| It is capable to assemble large genomes at speed 10 times faster than Canu, but it is not specifically optimized for metagenome assembly, therefore usually only the very dominant populations could be assembled. | |||
| Hybrid‐assembly | MetaSPAdes | MetaSPAdes is a de novo assembler capable of hybrid‐assembly of illumina SRs and nanopore LRs with the classic Spades algorithm. Nanopore LRs will be used to simplify the SR‐constructed De Bruijn graph by closing gaps and resolving repeats. MetaSPAdes will not correct the errors on nanopore LRs. The postcorrected nanopore LRs can be simply provided as single long reads to SPAdes. | [48] |
| Unicycler | Unicycler is a de novo assembler designed to optimize the hybrid assembler of illumina SRs and nanopore LRs for bacterial isolates. To simplify the graph and produce longer contigs, nanopore LRs were semiglobally aligned to the assembly graph constructed based on SRs by SPAdes. If only nanopore LRs were provided as input, it will run a miniasm + Racon pipeline. | [49] | |
| LRs‐correction | Medaka | Medaka is a tool to create consensus sequences and variant calls from nanopore sequencing data. It performs the task by neural networks, which apply a pileup of individual sequencing reads against a draft assembly. | https://github.com/nanoporetech/medaka |
| Racon | Racon is intended as a standalone graph‐based consensus module to correct raw contigs generated by rapid assembly of nanopore LRs. | [57] | |
| SRs‐correction | Pilon | Pilon is a software tool which can be used to correct indels and single base errors in nanopore data sets based on the BAM files of illumina SRs aligned to nanopore LRs. | [58] |
| Polypolish | Polypolish is a tool for polishing genome assemblies with SRs, in which it uses SAM files where each read has been aligned to all possible locations (not just a single best location). This allows it to repair errors in repeat regions that other alignment‐based polishers cannot fix. | [77] | |
| Frame‐shift correction | LAST + FUNpore | LAST is the first alignment tool to perform the frame‐shift aware alignment when aligning nucleotide sequences against a functional gene database consisting of amino acid sequences. The adaptive seed algorithm of LAST has shown the highest sensitivity in function gene identification on nanopore LR [86]. | [50, 87] |
| FUNpore is a software toolkit to correct the frame‐shift errors by inserting Ns into the nanopore LRs to maintain the frame based on the locations of frame‐shifts reported in the LAST alignments. | |||
| Diamond + MEGAN‐LR | Diamond is a widely used fast alignment tool originally designed for SR alignment. Since DIAMOND v 0.9.23, it updated with the function to perform frame‐shift aware DNA‐to‐protein alignment. | [62] | |
| MEGAN‐LR was a GUI‐based software which can correct frame‐shift errors in nanopore LRs. MEGAN‐LR is included in the default package of the free community version of MEGAN6. | |||
| Alignment | LAST | LAST is a software that adopted an adaptive seed and fitting algorithm, which was ideal for DNA‐to‐DNA or DNA‐to‐protein alignment of error‐prone nanopore LRs. LAST has shown the highest sensitivity in function gene identification on nanopore LR [86]. | [63] |
| Minimap2 | Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Typical use cases include: (1) mapping PacBio or nanopore reads to the human genome; (2) finding overlaps between long reads with error rate up to ~15%; (3) splice‐aware alignment of PacBio Iso‐Seq or nanopore cDNA or Direct RNA reads against a reference genome; (4) aligning illumina single‐ or paired‐end reads; (5) assembly‐to‐assembly alignment; (6) full‐genome alignment between two closely related species with divergence below ~15%. | [59] | |
| Metagenomic binning tools | MetaWRAP | MetaWRAP is an easy‐to‐use metagenomic wrapper suit that accomplishes the core tasks of metagenomic analysis including binning, taxonomic profiling, and functional annotation. It extracts MAGs from metagenomic data sets by combining results from MetaBAT2, MaxBin2, and CONCOCT. It could deliver refined and dereplicated binning results for subsequent annotation. It is particularly useful to carry out differential binning in metagenomic data sets. | [78] |
| MetaBAT2 | MaxBin 2.0 employs an Expectation–Maximization (EM) algorithm to recover draft genomes from metagenomes. It is the most commonly used tool when binning single integrated metagenomic data set. | [79] | |
| Phylogenetic annotation | Centrifuge | Centrifuge is a very rapid and memory‐efficient system for the classification of DNA sequences from microbial samples. The system uses a novel indexing scheme based on the Burrows‐Wheeler transform (BWT) and the Ferragina–Manzini (FM) index, optimized specifically for the metagenomic classification problem. Centrifuge requires a relatively small index (e.g., 4.3 GB for ~4100 bacterial genomes) yet provides a very fast classification speed. | [82] |
| Kraken2 | Kraken is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies. Kraken aims to achieve high sensitivity and high speed by utilizing exact alignments of k‐mers and a novel classification algorithm. Kraken's accuracy is comparable with Megablast, with slightly lower sensitivity and very high precision. | [46] | |
| ARGpore2 | ARGpore2 is a software package in which a MEGAN‐like LCA voting algorithm was first applied to generate taxonomic affiliation of each nanopore LR based on the annotation results of Centrifuge. Next, the derived affiliation will be validated and improved by LAST against MetaPhlan2 marker gene database, whose unique clade‐specific marker genes could achieve species‐level resolution for bacteria, archaea, eukaryotes, and viruses identification. This tool also annotates antibiotic resistance genes on nanopore LRs by LAST against an nt‐version of SARG database [88]. | [72] | |
| Functional annotation | Prokka | Prokka is a tool to annotate bacterial, archaeal, and viral genomes quickly and produce standards‐compliant output files. Whole genome annotation is the process of identifying features of interest in a set of genomic DNA sequences, and labeling them with useful information. | [67] |
Abbreviations: GFA, graphical fragment assembly; LR, long read; SR, short read.