Skip to main content
. 2016 Apr 13;6:24373. doi: 10.1038/srep24373

Table 1. Description of BPGA Pipeline.

Features Description Tools/scripts Notes Equivalent tools. Citation
Preparation step Preprocessing of raw files (.faa, .fsa or any fasta or .gbk) leading to a single input file required for clustering. BPGA script BPGA modifies the files by inserting genome ID into the sequence headers. NA This study
Clustering It is used to cluster genes based on sequence similarity into orthlogous clusters. USEARCH#, CD-HIT*, OrthoMCL*. USEARCH is fastest clustering tool so far. BPGA uses it as default clustering tool and can also process the clusters from other two. Roary, PGAP, PGAT, ITEP, Panseq. [25,27, 28, 29,45]
Matrix Generation (Pan-Matrix) It generates 1,0–binary presence/ absence matrix from orthlogous clusters. BPGA script BPGA script checks the presence or absence of genes from the individual strains and writes in the form of matrix. Roary, PanGP, PGAP. [26, 27, 28]
Pan-Genome Profile Analysis Calculates shared genes after stepwise addition of each individual genome. This trend can be plotted as Core or Pan-genome Profile Curves. BPGA script, gnuplot. BPGA script calculates such trends taking different permutations/combinations of genomes. Roary, PanGP, PGAP. [26, 27, 28]
Phylogeny Construction Pan Phylogeny: Generates a phylogenetic tree based on pan-matrix data. Core/MLST Phylogeny: Generates a phylogenetic tree based on concatenated core/housekeeping gene alignments. BPGA script, MUSCLE#, Librsvg. BPGA script concatenates the core sequences from all strains and converts pan-matrix into Newick tree. MUSCLE is faster and more accurate alignment and tree generator tool. Roary, PGAP, Panseq, ITEP. [25,27, 28, 29]
Function and Pathway Analysis COG and KEGG Assignments on the basis of best hits with respective reference databases. USEARCH#, BPGA script, gnuplot. Best hits are processed to get the % occurrences for all COG & KEGG pathway categories. COG: PGAP, PGAT,ITEP. KEGG Analysis: None [28,29,45]
Pan-Genome Statistics It provides genome wise core, accessory, unique and exclusively absent gene counts. BPGA script Gives an idea about contribution of each strain to the pan-genome. None This study
Atypical GC Content Analysis Identifies genes with substantial high or low GC content from their genomic GC content. BPGA script Applicable only if Genbank files are used as input. None This study
Subset Analysis Divides the original dataset into user defined smaller subsets and performs default pan genomic analyses. BPGA script The subsets may be based on pathogenic potential, habitat, taxonomical groups or any other criteria. None This study
Exclusive gene absence Identifies the clusters showing exclusive absence of a gene from the specific strain. BPGA script Sequences of such clusters are given in output file. None This study

#Automated by BPGA script.

*Supported outputs.

These are novel features by BPGA, NA-Not Applicable.