. 2021 Apr 2;11(4):530. doi: 10.3390/biom11040530

Table 1.

Examples of widely used tools to perform next generation sequencing data analysis for the gut microbiome studies.

Software	Short Description	Ref.
16S rRNA, 18S rRNA and ITS sequencing data analysis
UCLUST/ UPARSE	UCLUST is an OTU-based clustering method. It employs USEARCH, and UPARSE is a subroutine of USEARCH which constructs OTUs de novo from next-generation reads. The general pipeline procedure of UPARSE is reads filtering, trimming, and then clustering and chimera filtering simultaneously. Pros: Able to perform de novo, closed-reference, and open-reference clustering. Cons: May filter out too many reads and result in inaccuracy of estimating the least abundant species.	[19,20]
CD-HIT	CD-HIT is one of the most used OTU-based clustering tool to decrease redundancy of sequence and improve the performance of other analysis. Pros: Uses novel parallelization strategy to achieve fast runtime; can handle extremely large databases. Cons: Diminished clustering accuracy.	[21]
Hc-OTU	Hc-OTU is an OTU-based clustering method for 16S rRNA sequence, employs homopolymer compaction and k-mer profiling. Pros: High accuracy. 7,000 times faster than MOTHUR and about six times faster than ESPRIT-TREE, while remaining the same accuracy level as MOTHUR. Supports user-specified k-mer distance threshold parameter value. Cons: Its worst-case computational complexity run time is O(n²), while UCLUST and CD-HIT are faster than hc-OTU with run time of O(n^1.2).	[22]
ESPRIT	ESPRIT is an OTU-based hierarchical clustering method consisting of quality filtering, computing pairwise distance, hierarchical clustering and estimate with statistical interference. There are two version of ESPRIT, one for personal computer (small/medium size data) and one for computer clusters (large size data). Pros: Able to perform analysis on various size of data. Cons: Slow time O(n²) and space complexity.	[23]
ESPRIT-Tree	ESPRIT-Tree is an OTU-based online-learning-based hierarchical clustering method. ESPRIT-TREE improves on previous ESPRIT algorithm and uses a pseudometric-based partition tree. Pros: Improved runtime from ESPRIT: O(n^1.17); relatively high accuracy. Cons: In terms of computational efficiency, UCLUST performs better than ESPRIT-Tree.	[24]
DADA2	DADA2 is an ASV-based analysis pipeline for modeling and error-correcting Illumina sequence reads. Pros: High accuracy: able to resolve single nucleotide biological differences. Can perform species-level analysis. Runtime scales linearly as sample number increase, and reasonable memory requirements. Cons: Comparably slow denoising algorithm than UPARSE.	[26]
UNOISE2	UNOISE 2 is an ASV-based tool for denoising (error-correcting) Illumina sequence reads. It is improved from UNOISE and clusters unique reads in the sequence. Pros: Higher accuracy and speed than DADA2. Cons: Does not use quality scores.	[27]
Deblur	Deblur is an ASV-based denoising tool, which uses error profiles to obtain putative error-free sequences. It operates independently on each sample. Pros: Able to obtain single-nucleotide resolution, faster than DADA2, better memory efficiency than DADA2 and UNOISE 2. Better sensitivity and specificity. Cons: Slower than UNOISE 2, limited by read length and sample sequences’ diversity.	[28]
QIIME/ QIIME2	QIIME and QIIME2 are bioinformatics platforms for microbial community analysis and visualizations. QIIME 2 is engineered based on QIIME and replaced QIIME. QIIME2 use existing bioinformatics tools as subroutines, such as DADA2, deblur, etc. Pros: Have multiple interfaces, continues to grow and adapt to novel strategies. Cons: A large number of dependent programs need to be installed.	[29,30]
Mothur	Mothur is a software analyzing raw sequences and generating visualization tools to describe α and β diversity. It is a combination of multiple analytic tools for describing and comparing microbial communities. It provides examples for data acquired from different sequencing platforms. Pros: Able to perform both ASV-based and OTU-based analysis. Cons: Relatively slow runtime and space complexity.	[31]
PICRUSt/ PICRUSt2	PICRUSt is a software for predicting functional composition based solely on marker gene sequence profiles. PICRUSt2 is the improved version of PICRUSt by having a larger reference database, enhanced prediction ability and more accurate de novo amplicon tree-building. PICRUSt2: Pros: Able to identify novel discoveries. Can process 18S and ITS rRNA sequence while the original version only supports 16s rRNA sequence analysis. Cons: Can only differentiate taxa the same level as the amplified marker gene sequence. Can be problematic if the interested microbial community’s majority phyla are not yet well-characterized.	[35,36]
Tax4Fun/ Tax4Fun2	Tax4Fun is an R package for predicting functional profiles for 16S rRNA data on the basis of SILVA-labeled OUT abundances. Tax4Fun 2 is an improved version of Tax4Fun with more accurate and enhanced prediction power. Tax4Fun 2: Pros: Easy-to-use, platform-independent and highly memory-efficient. Tax4Fun2 has higher accuracies than PICRUSt and Tax4Fun. Cons: Availability of suitable reference genomes may limit Tax4Fun 2’s performance. Only supports prediction from 16S rRNA gene.	[37,39]
Piphillin	Piphillin is a web application that produces metagenome predictions based on the nearest-neighbor mappings of 16S rRNA sequences to genome. Pros: No local computational power requirements. High correlation with corresponding metagenomic data. Higher accuracy than PICRUSt2 Cons: Have high requirements on reference database. Only supports 16S rRNA gene prediction.	[40]
Vikodak/ iVikodak	Vikodak is a web service that provides functional prediction on 16S rRNA data. It contains 3 modules: Global Mapper, Inter Sample Feature Analyzer, and Local Mapper. With these 3 modules, it is able to perform functional prediction both globally and in detail and perform pair-wise comparative statistical analysis. iVikodak is an improved version of Vikodak. Pros: No local computational power requirements. No coding skill required. Allows for single pathway probing and gene quorum assumption. Cons: Only supports prediction from 16S rRNA gene.	[41]
SSU-ALIGN	SSU-ALIGN is designed primarily to align 16S and 18S small subunit ribosomal RNA, but can also be used for large subunit ribosomal RNA alignment. Pros: High sensitivity and specificity. Cons: Not capable of inferring phylogenetic trees. Computationally expensive.	[65]
LotuS2	LotuS2 is a software pipeline for 16S/18S/ITS rRNA analysis. It is able to calculate denoised, chimera-checked OTUs and construct OTU phylogenetic tree. Pros: Fast and user friendly. Able to handle a wide variety of data sizes on a personal computer. Cons: Mapping speed limited by BLAST+.	[66]
MICCA	MICCS is a command-line software for the processing of 16S rRNA gene and ITS amplicon sequencing data, from raw sequences to OTU tables, taxonomic classification and phylogenetic tree inference. Pros: Can be used effectively on sample with a large portion of uncharacterized species. Low requirements for reference database. Memory efficient. Cons: Less estimated OTUs obtained as a comprise for high consistency.	[67]
PEMA	PEMA is a software pipeline for metabarcoding analysis based on third-party tools. Its function includes read pre-processing, OTU clustering, ASV inference, taxonomy assignment, and COI marker gene analysis. Pros: Allows partial re-execution. Fast execution time. Cons: Heavyweight computation.	[68]
ITScan	ITScan is an online pipeline for fungal diversity analysis and identification based on ITS sequences. Pros: Does not require coding skills. User friendly. Cons: Requires FASTA-formatted input file.	[69]
ITSx	ITSx is a software for detection and extraction of the ITS1 and ITS2 subregions from ITS sequences for fungi and other eukaryotes. It relies on HMMER for profile hidden Markov model analysis. Pros: Has a very high proportion of true-positive extractions and a low proportion of false-positive extractions. Cons: Requires FASTA-formatted input file.	[70]
ITSxpress	ITSxpress is a software for ITS1, ITS2 or the entire ITS region trimming. It implements HMMER and BBMerge. It is designed to support the calling of exact sequence variants rather than OTUs. Pros: Fast runtime. Processes FASTQ-formatted input file.	[71]
Mycofier	Mycofier is a machine-learning-based fungal ITS1 sequence classifier at the genus level. The final model was based on ITS1 sequences from 510 fungal genera using a Naïve Bayes algorithm. Pros: Doesn’t require pairwise sequence alignment. Cons: Only analyze fungal ITS1 sequences. BLAST approach provides higher classification accuracy.	[72]
Shotgun metagenomic and metatranscriptomic sequencing data analysis
Trimmomatic	Trimmomatic is a sequence trimmer for Illumina sequence data. It has multiple processing steps including detection and removal of adapter and other illumine-specific sequences, and quality filtering. Pros: Processes both paired end and single end data. Cons: Slower than Ktrim.	[89]
Ktrim	Ktrim provides both adapter- and quality-trimming of the sequencing data. Pros: Faster than Trimmomatic. Cons: Higher over-trimming rates than Trimmomatic.	[90]
Cutadapt	Cutadapt is a sequence trimmer which removes adapter sequences, primers and other types of unwanted sequence from high-throughput sequencing reads. Pros: Supports 454, Illumina and SOLiD (color space) data. Cons: Slow runtime.	[91]
MultiQC	MultiQC creates a summary report visualizing output from different tools across multiple samples, facilitating the identification of global trends and biases. Pros: Provides a global view instead of per-sample analysis.	[92]
Bowtie2	Bowtie2 is a software for sequence alignment to reference genome. It supports gapped, local, and paired-end alignments. The software implements full-text minute index and SIMD dynamic programming. Pros: Memory efficient. High speed, sensitivity and accuracy. Cons: Alignment with short reads remains an active challenge (<50 bp).	[95]
DIAMOND	DIAMOND is a sequence aligner for protein and translated DNA searches. It aims to determine all significant alignments for a given input. DIAMOND uses double indexing and spaced seeds. Pros: Significantly higher speed with similar sensitivity to BLASTX. Cons: Heavy memory consuming.	[96]
BBMap	BBMap is a sequence aligner that can align DNA and RNA sequencing reads from multiple platforms, including Illumina, 454, Sanger, Ion Torrent, Pac Bio, and Nanopore. BBMap needs to index a reference before mapping to it. Pros: Fast and accurate, particularly for reads with long indels or highly mutated genomes. Has no upper limit to number of contigs or genome size. Cons: The indexing phase requires FASTA format only.	[97]
Meta-IDBA	Meta-IDBA is a de novo metagenomic assembler. It first constructs de Bruijn graph and then divides graph into connected components. Pros: Provides a multiple alignment of similar contigs from different subspecies in the same species. Cons: Unable to reconstruct the contigs of each single subspecies.	[105]
IDBA-UD	IDBA-UD is a de novo single-cell and metagenomic assembler, which can assemble sequences with highly uneven depth. It is based on de Bruijn graph approach. Pros: Implements local assembly. Cons: Sequence of species with high abundance is more likely to be misidentified as repeats.	[106]
MetaVelvet	MetaVelvet is a de novo short sequence metagenome assembler. It is extended upon the Velvet assembler (single-genome and de Bruijn-graph based) to overcome the limitations of single-genome assembler. Pros: Able to reconstruct scaffold sequences including low-abundance species. Cons: Has slightly higher percentages of chimeric scaffolds.	[107]
MegaHit	MegaHit is a de novo assembler for assembling metagenomics data. It implements succinct de Bruijn graphs. Pros: Fast and memory efficient. Available in both CPU-only and GPU-accelerated versions. Cons: Relatively biased towards the assembly of low abundant genome fragments.	[108]
MetaQUAST	MetaQUAST evaluates and compares the quality of metagenome assemblies. It is improved based on QUAST. Its metagenome specific features includes: unlimited number of reference genome, species content detection, chimeric detection, and visualizations. Pros: Can be fed with multiple assemblies. Cons: Reduced precision in order to get higher time/memory efficiency.	[110]
MEGAN	MEGAN is a BLAST-based automated pipeline for taxonomic and functional analysis of metagenomic and metatranscriptomic datasets. Pros: Allows laptop analysis of large metagenomic data sets.	[111]
MetaPhlAn/ MetaPhlAn2	MetaPhlAn is an automated pipeline that profiles the microbial composition from shotgun metagenomic data at the species-level. The microbial community it can profile includes bacteria, archaea, eukaryotes and viruses. It accomplishes profiling with unique clade-specific marker genes. MetaPhlAn 2 is extended beyond the first version with enhanced metagenomic taxonomic profiling ability. Pros: Able to work with large-scale metagenome data.	[112]
HUMAnN2	HUMAnN2 is an automated pipeline designed for functional analysis of metagenomic and metatranscriptomic data at the species-level. The general process of HUMAnN2 pipeline is identification of known species, alignment of reads to pangenomes, translated search on unclassified reads, and quantification of gene families and pathways. HUMAnN2 utilizes other pipelines such as MetaPhlAn2 to perform identification of known species. Pros: High accuracy, sensitivity, speed. Cons: A large proportion of sequencing reads remain unmapped and unintegrated.	[113]
MG-RAST	MG-RAST is a web-based fully automated system for metagenomic analysis. It provides phylogenic and functional analysis. Pros: Require only 75 bp or longer for gene prediction or similarity analysis that provides taxonomic binning and functional classification. Able to handle both assembled and unassembled data. Cons: MG-RAST has been optimized for use with the Firefox browser. There are some browser-to-browser issues with visualization of certain diagrams.	[114]
IMG/M	IGM/M is a web-based pipeline that provides comparative analysis for metagenome. It provides structural and functional annotation. Prefer assembled contigs. Pros: Integrates all datasets into a single protein level abstraction. In contrast to MG-RAST, IMG/M includes more computationally expensive tools such as hidden Markov model and BLASTX. Cons: Statistical analysis tool is only available as an on-demand computation to the registered IMG users of the Expert Review IMG site.	[115]
METAREP	METAREP is a suite of web-based tools to view and compare metagenomic annotated data including both functional and taxonomical assignments. Pros: Able to handle extremely large datasets. Able to perform comparison on up to 20+ datasets simultaneously. Cons: No inbuilt annotation workflow. Users need to upload existing annotations.	[116]
CuffDiff	Cufflinks is a suite of programs that assembles transcriptomes, estimates abundance, and performs gene expression differentiations. It implements a parsimony-based algorithm. Pros: High efficiency, sensitivity and precision. Cons: Not optimized for metatranscriptomics analysis.	[123]
Blast2GO	Blast2Go is a Blast-based software that provides automatic functional annotation on DNA/protein sequences. It has multiple annotation styles that can be used for various conditions. Pros: Combines multiple annotation strategies. Strong visualization tools. Con: Not optimized for large datasets with large number of genes.	[124]
Viromic sequencing data analysis
VICUNA	VICUNA is a de novo assembler targeting viral populations, which have high mutation rates. Its algorithm uses an overlap-layout-consensus based approach. The general process of VICUNA is trimming reads, constructing/clustering contigs, validating contigs, and then extending and merging contigs. Pros: Able to efficiently process ultra-deep sequence data. High accuracy and continuity. Cons: Limited accessibility due to its requirement of local computing power.	[135]
Metavir/ Metavir2	Metavir is a web-based pipeline specifically for viral metagenome analysis. Metavir 2 is developed based on Metavir with additional features such as new tools for assembled virome sequence analysis and new dataset comparison strategies.Pros: User-friendly interface. Able to perform analysis on both raw reads and assembled virome sequencesCons: Focuses on the compositional analysis. Functional annotation is lacking.	[136,137]
VMGAP	VMGAP is an automated pipeline for functional annotation of viral shotgun metagenomic data. It first performs a database searches and then functional assignments. Pros: Uses specialized databases. Cons: Requires local installation of several open-source packages, programs and public databases.	[138]