. 2020 Dec 3;21(Suppl 9):523. doi: 10.1186/s12859-020-03831-9

Table 2.

Computational frameworks designed to detect microbiota from human sequences by subtractive, filtration, or mixed methods

Framework	Approach	Dependencies	Input \| output	Advantages/disadvantages	Cancer validation	Refs.
PathSeq	Alignment and de novo assembly	BLAST BLASTN BLASTX MAQ MegaBLAST RepeatMasker Velvet	Input: RNA-seq or DNA-seq Output: Pathogen presence/absence	Scalable cloud computing Feasible for known and novel pathogen identification Two-pass subtraction with increased filtering costs	Cervical cancer (cell line and simulated data) TCGA ovarian	[63, 68]
SRSA	Alignment and de novo assembly	Velvet MegaBLAST BLAST BWA TopHat	Input: RNA-seq Output: Species-level taxonomy characterization (prevalence)	Incorporates sample pre-processing, quality filtering, sequence mapping, and assembly Not freely available No known updates Original work validation was limited to cell line	HIV-1 cell line	[60]
CaPSID	Mix-method, simultaneous alignment, filtration and de novo assembly	BioPython Bowtie2 Trinity	Input: RNA-seq or DNA-seq Output: Top-hit pathogen genome identification ranked by maximum gene coverage	Web-based, open-source and scalable application; Modular analyses; Single pass filtering, which may fail to subtract host reads	Ovarian cancer TCGA stomach	[67]
SURPI	Dual scanning mode; Known pathogens identification or de novo assembly	SNAP RAPSearch BWA BLASTN Bowtie2 DUST in PRINSEQ	Input: Paired-end metagenomic Output: Species-level taxonomic classification and coverage map	Scalable to cloud or standalone servers Capacity to incorporate reference database Dual-mode: quantitative and semi-quantitative pathogen identification	Prostate cancer (cell line, tissue biopsies) Colorectal cancer (tissue biopsies)	[71]
PathoScope 2.0	Penalized probabilistic identification; Modular filtration, alignment and assignment	SAMtools BLASTX Bowtie2 thetaPrior	Input: Metagenomic or genomic (RNA-seq or DNA-seq) Output: Strain level pathogen relative abundance	Modular detailed result reporting with Designed for low abundance strain-level identification MySQL server required; no connection to the population structure of relevant species	TCGA stomach	[69, 70]
VirusScan	Identification of known viral and integration sites	BWA BLAST MegaBLAST Pindel RepeatMasker PHYLIP	Input: RNA-seq Output: Viral read abundance and integration sites	Designed for viral identification; Abundance and integration sites analyses	TCGA cancer cohorts	[72]
MetaShot	Two-step similarity filtering and taxonomic assessment	Bowtie2 TANGO STAR Bash	Input: RNA-Seq or DNA-Seq Output: Assigned read report and Krona plot with relative abundance	Extracts unassigned reads; Allow for functional annotations; Slower than other applications	None	[73]
ConStrains	Marker-based (SNP patterns) Strain-level prediction	MetaPhlAn PhyloPhlAn Bowtie2 SAMtools Metropolis-Hasting Monte-Carlo	Input: Metagenomics (RNA-seq) Output: Strain-level prediction and relative abundance	Single reference strain collection; Facilitates functional analyses when combined with reference genome-based gene coverage metadata	None	[74]
RINS	Intersection based identification and removal	Bowtie BLAST BLAT Trinity	Input: Mate-paired RNA-seq unmapped reads Output: Pathogen contigs	Requires prior knowledge of reference; Detection limited to user-defined parameters	Prostate cancer (cell line)	[66]
GRAMMy	Mix- model Bayesian, Expectation–Maximization and maximum likelihood estimation	BLAST BLAT MAQ Bowtie PerM BLASY	Input: Metagenomics reads Output: Genomic relative abundance as numerical vectors	User flexibility Probabilistic handling of ambiguous hits Computational efficiency	None	[76]

Comparison of computational workflows designed to derive microbial content from human sequences by subtractive and filtering methods, broadly categorized as reference-based, reference-free, and mixed methods approaches. Data requirements to run the pipeline, output information, as well as advantages and disadvantages for each, are summarized. Most have been validated with large cancer datasets, including TCGA sequencing data. ConStrains is based on reference-free, while all other approaches are reference-based or mixed-methods