Table 1. Description of BPGA Pipeline.
Features | Description | Tools/scripts | Notes | Equivalent tools. | Citation |
---|---|---|---|---|---|
Preparation step | Preprocessing of raw files (.faa, .fsa or any fasta or .gbk) leading to a single input file required for clustering. | BPGA script | BPGA modifies the files by inserting genome ID into the sequence headers. | NA | This study |
Clustering | It is used to cluster genes based on sequence similarity into orthlogous clusters. | USEARCH#, CD-HIT*, OrthoMCL*. | USEARCH is fastest clustering tool so far. BPGA uses it as default clustering tool and can also process the clusters from other two. | Roary, PGAP, PGAT, ITEP, Panseq. | [25,27, 28, 29,45] |
Matrix Generation (Pan-Matrix) | It generates 1,0–binary presence/ absence matrix from orthlogous clusters. | BPGA script | BPGA script checks the presence or absence of genes from the individual strains and writes in the form of matrix. | Roary, PanGP, PGAP. | [26, 27, 28] |
Pan-Genome Profile Analysis | Calculates shared genes after stepwise addition of each individual genome. This trend can be plotted as Core or Pan-genome Profile Curves. | BPGA script, gnuplot. | BPGA script calculates such trends taking different permutations/combinations of genomes. | Roary, PanGP, PGAP. | [26, 27, 28] |
Phylogeny Construction | Pan Phylogeny: Generates a phylogenetic tree based on pan-matrix data. Core/MLST Phylogeny: Generates a phylogenetic tree based on concatenated core/housekeeping gene alignments. | BPGA script, MUSCLE#, Librsvg. | BPGA script concatenates the core sequences from all strains and converts pan-matrix into Newick tree. MUSCLE is faster and more accurate alignment and tree generator tool. | Roary, PGAP, Panseq, ITEP. | [25,27, 28, 29] |
Function and Pathway† Analysis | COG and KEGG Assignments on the basis of best hits with respective reference databases. | USEARCH#, BPGA script, gnuplot. | Best hits are processed to get the % occurrences for all COG & KEGG pathway categories. | COG: PGAP, PGAT,ITEP. KEGG Analysis: None | [28,29,45] |
Pan-Genome Statistics† | It provides genome wise core, accessory, unique and exclusively absent gene counts. | BPGA script | Gives an idea about contribution of each strain to the pan-genome. | None | This study |
Atypical GC Content Analysis† | Identifies genes with substantial high or low GC content from their genomic GC content. | BPGA script | Applicable only if Genbank files are used as input. | None | This study |
Subset Analysis† | Divides the original dataset into user defined smaller subsets and performs default pan genomic analyses. | BPGA script | The subsets may be based on pathogenic potential, habitat, taxonomical groups or any other criteria. | None | This study |
Exclusive gene absence† | Identifies the clusters showing exclusive absence of a gene from the specific strain. | BPGA script | Sequences of such clusters are given in output file. | None | This study |
#Automated by BPGA script.
*Supported outputs.
†These are novel features by BPGA, NA-Not Applicable.