Table 2.
Computational frameworks designed to detect microbiota from human sequences by subtractive, filtration, or mixed methods
Framework | Approach | Dependencies | Input | output | Advantages/disadvantages | Cancer validation | Refs. |
---|---|---|---|---|---|---|
PathSeq | Alignment and de novo assembly |
BLAST BLASTN BLASTX MAQ MegaBLAST RepeatMasker Velvet |
Input: RNA-seq or DNA-seq Output: Pathogen presence/absence |
Scalable cloud computing Feasible for known and novel pathogen identification Two-pass subtraction with increased filtering costs |
Cervical cancer (cell line and simulated data) TCGA ovarian |
[63, 68] |
SRSA | Alignment and de novo assembly |
Velvet MegaBLAST BLAST BWA TopHat |
Input: RNA-seq Output: Species-level taxonomy characterization (prevalence) |
Incorporates sample pre-processing, quality filtering, sequence mapping, and assembly Not freely available No known updates Original work validation was limited to cell line |
HIV-1 cell line | [60] |
CaPSID | Mix-method, simultaneous alignment, filtration and de novo assembly |
BioPython Bowtie2 Trinity |
Input: RNA-seq or DNA-seq Output: Top-hit pathogen genome identification ranked by maximum gene coverage |
Web-based, open-source and scalable application; Modular analyses; Single pass filtering, which may fail to subtract host reads |
Ovarian cancer TCGA stomach |
[67] |
SURPI | Dual scanning mode; Known pathogens identification or de novo assembly |
SNAP RAPSearch BWA BLASTN Bowtie2 DUST in PRINSEQ |
Input: Paired-end metagenomic Output: Species-level taxonomic classification and coverage map |
Scalable to cloud or standalone servers Capacity to incorporate reference database Dual-mode: quantitative and semi-quantitative pathogen identification |
Prostate cancer (cell line, tissue biopsies) Colorectal cancer (tissue biopsies) |
[71] |
PathoScope 2.0 | Penalized probabilistic identification; Modular filtration, alignment and assignment |
SAMtools BLASTX Bowtie2 thetaPrior |
Input: Metagenomic or genomic (RNA-seq or DNA-seq) Output: Strain level pathogen relative abundance |
Modular detailed result reporting with Designed for low abundance strain-level identification MySQL server required; no connection to the population structure of relevant species |
TCGA stomach | [69, 70] |
VirusScan | Identification of known viral and integration sites |
BWA BLAST MegaBLAST Pindel RepeatMasker PHYLIP |
Input: RNA-seq Output: Viral read abundance and integration sites |
Designed for viral identification; Abundance and integration sites analyses |
TCGA cancer cohorts | [72] |
MetaShot | Two-step similarity filtering and taxonomic assessment |
Bowtie2 TANGO STAR Bash |
Input: RNA-Seq or DNA-Seq Output: Assigned read report and Krona plot with relative abundance |
Extracts unassigned reads; Allow for functional annotations; Slower than other applications |
None | [73] |
ConStrains |
Marker-based (SNP patterns) Strain-level prediction |
MetaPhlAn PhyloPhlAn Bowtie2 SAMtools Metropolis-Hasting Monte-Carlo |
Input: Metagenomics (RNA-seq) Output: Strain-level prediction and relative abundance |
Single reference strain collection; Facilitates functional analyses when combined with reference genome-based gene coverage metadata |
None | [74] |
RINS | Intersection based identification and removal |
Bowtie BLAST BLAT Trinity |
Input: Mate-paired RNA-seq unmapped reads Output: Pathogen contigs |
Requires prior knowledge of reference; Detection limited to user-defined parameters |
Prostate cancer (cell line) |
[66] |
GRAMMy | Mix- model Bayesian, Expectation–Maximization and maximum likelihood estimation |
BLAST BLAT MAQ Bowtie PerM BLASY |
Input: Metagenomics reads Output: Genomic relative abundance as numerical vectors |
User flexibility Probabilistic handling of ambiguous hits Computational efficiency |
None | [76] |
Comparison of computational workflows designed to derive microbial content from human sequences by subtractive and filtering methods, broadly categorized as reference-based, reference-free, and mixed methods approaches. Data requirements to run the pipeline, output information, as well as advantages and disadvantages for each, are summarized. Most have been validated with large cancer datasets, including TCGA sequencing data. ConStrains is based on reference-free, while all other approaches are reference-based or mixed-methods