TABLE 1.
Long-read bioinformatics tools | ||||
---|---|---|---|---|
Data analysis step | Tool name | Background and performance | References | |
Complex user-friendly interfaces capable of perform the whole analysis process exept error correction: PacBio: SMRT link (BioSciences) Nanopore: EPI2ME Labs (Nanopore) | QC metrics | FastQC, MultiQC, LongQC, NanoPack, MinIONQC, NanoR, RNASeQC | The listed items are quality control (QC) tools suitable for sequencing approaches, including long- and short-reads. Their aim is to provide QC checks on raw sequence data (FastQC) or dataset (MultiQC) and give detailed feedback regarding the occurring problems. For RNA-seq data, an unique algorithm (RNA-SeQC) was developed | [47–54] |
Base calling | SMRT analysis tools, Dorado, Guppy | Neural network and statistical method based base calling methods; SMRT reads require specific analysis tools. Dorado and Guppy were developed for NS reads | [55–57] | |
Variant calling | Clair3, Sniffles | Sniffles perform structural variant calling on noisy long-read data. Clair3 is a deep neural network based variant caller even capable of haplotype-sensitive variant detecion performing variant detection from sequencing data containing modified bases | [58–60] | |
wf-human-variation, wf-somatic-variation | Complex command line compatible workflows for NS variant detection. On demand, the separate or combined usage of tumor and normal data is insured with the production of well-detailed analysis reports | [61] | ||
Modified base calling | Modbamtools, Guppy, Mekada, DeepSignal, DeepMod | Set of tools to manipulate and visualize DNA/RNA base modification and methylation data that are stored in.bam format. Some of them is suitable for all long-read techniques. The detectable modified bases are 5mC, 5hmC and 6 mA | [33, 57–59, 62, 63] | |
Genome assembly | Flye, Canu, HiCanu, BLASR, FALCON | Some of them are graph construction-based method (Flye) or using hierarchical genome assembly process with clustering (BLASR) and overlap-based error correction, also carry out phasing (FALCON) during the accomplishment of de novo genome assembly on high-noise single-molecule sequencing data | [64–68] | |
Visualization | NanoPack, R packages: maftools, ggplot2, Python packages: matplotlib (pyVolcano) | Packages offering universal and problem-specific solutions for long-read data visualization | [50, 69–72] | |
Error correction | Pilon, Racon, DeepConsensus, Medaka | Neural network- and transformer-based methods, which are intended as standalone modules to correct raw contigs generated by rapid assembly methods which include or do not include a consensus step. An advantage of the application of transformer-based error correction methods is that they leverage a unique alignment loss to correct sequencing errors | [33, 35, 71] |
Additional packages are listed on webpage https://long-read-tools.org and can be found on bioinformatics-related pages.