Skip to main content
. 2022 Dec 15;13:1030890. doi: 10.3389/fpls.2022.1030890

Table 2.

List of softwares used in De novo assembly pipeline with their functions.

Name of Software Raw data Data filtration Output
Through Sequencer ( Leng et al., 2012 ) RNA-Seq Raw Reads Short Reads- Obtained from the common NGS platforms, including Illumina, SOLiD and 454, are often very short bases (35-500 bp).
Long Reads- Oxford Nanopore/PacBio Sequencers can sequence up to long 5 to 100Kb reads.
Massively parallel millions to billions sequence that offers high- throughput, scalability, and takes lesser time.
FastQc ( Andrews, 2010 ) Quality Check (QC) Quality assessment-Evaluate the raw read quality, identify the adaptor contaminations, and identify low quality samples Good/Bed- According to Phred Score quality (Q-value)
Trimgalore Read clean up (If contaminated) Trimming- Removes the bad bases (adaptor sequences and low-quality bases) at start and end of the reads
Filtering- Removes contaminants, low complexity reads (repeats), short reads less than 20 bases
K-mer- Shorter nucleotides than the read length
De Bruijn graph- Several transcriptome assembly programs. Every path in the graph denotes a potential transcript for transcriptome assembly.
De novo assembly ( Zhao et al., 2011 ) Trinity (Henschel et al., 2012) Quality/Phred Score quality (Q-score)- prediction of the probability (P) of an error in base calling.
Q(phred) = -10 log10P
Or P = 10–Q/10
De novo Assembler- A novel method for the efficient and robust de novo reconstruction of transcriptomes.
Software modules- Inchworm, Chrysalis, and Butterfly
RSEM ( Li and Dewey, 2011 ) Transcript Abundance Estimation Assembly Statistics N50 length is defined as the shortest sequence length at 50% of the transcriptome
Cd-hit ( Suzek et al., 2007 ) Transcript clustering Generate unigenes Group of transcript sequences
edgeR ( Robinson et al., 2010 ) Differential Expression Analysis edgeR can be applied to differential expression at the gene, exon, transcript, or tag level. In fact, any genetic feature can be utilized to calculate read counts. There are two testing methods: likelihood ratio tests and quasi-likelihood F-tests. The key abilities of package, and then gives several fully worked case studies, from counts to list of genes
TransDecoder ( Wang et al., 2009 ) Coding DNA Prediction CDS prediction from unigenes Segments of a gene’s (mRNA) that code for protein.
Blast2GO ( Martin et al., 2004 ) Gene Ontology (GO) Mapping and annotation In detail, describe a gene/gene product, including three main characteristics: molecular function (MF), Biological process (BP), cellular compound (CC)
Trinolate ( Wang et al., 2009 ) Functional Annotation (COG) Clusters of Orthologous Groups (for prediction of individual proteins function), Ven diagram (to identify common genes of all software), Pfam domain (to identification of protein family), Volcano plot (gene expression), Scattered plot (for normalization of obtained values), Heatmap (for highly significant differential expressed genes) Portion identification and Gene prediction (process of collecting information about and describing a gene’s)
KAAS ( Moriya et al., 2007 ) Pathway Prediction Pathway analysis against KEGG databases Identification of biological functions
DESeq2 ( Love et al., 2017 ) Differential Expression Analysis Normalization, differential analysis and visualization of high- dimensional count data Count matrices can be collapsed using collapse Replicates, which helps to combine counts from technical replications into single columns.