Skip to main content
. 2018 Aug 7;7(8):giy098. doi: 10.1093/gigascience/giy098

Table 1:

Bioinformatics tools and algorithms based on Apache Spark

Name Function Features Pros/Cons Reference
SparkSW Alignment
and mapping
Consists of three phases: data preprocessing, SW as map tasks, and top K records as reduce tasks Load-balancing, scalable, but without the mapping location and traceback of optimal alignment [16]
DSA Alignment and mapping Leverages data parallel strategy based on SIMD instruction Up to 201 times increased speed over SparkSW and almost linearly increased speed with increasing numbers of cluster nodes [17]
CloudSW Alignment and mapping Leverages SIMD instruction, and provides API services in the cloud Up to 3.29 times increased speed over DSA and 621 times increased speed over SparkSW; high scalability and efficiency [18]
SparkBWA Alignment and mapping Consists of three main stages: RDD creation, map, and reduce phases; employs two independent software layers For shorter reads, averages 1.9x and 1.4x faster than SEAL and pBWA. For longer reads, averages 1.4x faster than BigBWA and Halvade, but requires the data availability in HDFS [19]
StreamBWA Alignment and mapping Input data are streamed into the cluster directly from a compressed file ∼2x faster than nonstreaming approach, and 5x faster than SparkBWA [20]
PASTASpark Alignment and mapping Employs an in-memory RDD of key-value pairs to parallel the calculating MSA phase Up to 10x faster than single-threaded PASTA; ensures scalability and fault tolerance [21]
PPCAS Alignment and mapping Based on the MapReduce processing paradigm in Spark Better with a single node and shows almost linearly increased speeds with increasing numbers of nodes [22]
SparkBLAST Alignment and mapping Utilizes pipe operator and Spark RDDs to call BLAST as an external library Outperforms CloudBLAST in terms of speed, scalability, and efficiency [23]
MetaSpark Alignment and mapping Consists of five steps: constructing k-mer RefindexRDD, constructing k-mer ReadlistRDD, seeding, filtering, and banded alignment Recruits significantly more reads than SOAP2, BWA, and LAST and more reads by ∼4 than FR-HIT; shows good scalability and overall high performance [24]
Spaler Assembly Employs Spark's GraphX API; consists of two main parts: de Bruijn graph construction and contig generation Shows better scalability and achieves comparable or better assembly quality than ABySS, Ray, and SWAP-Assembler [25]
SA-BR-Spark Assembly Under the strategy of finding the source of reads; based on the Spark platform Shows a superior computational speed than SA-BR-MR [26]
HiGene Sequence analysis Puts forward a dynamic computing resource scheduler and an efficient way of mitigating data skew Reduces total running time from days to just under nearly an hour; 2x faster than Halvade [27]
GATK-Spark Sequence analysis Takes full account of compute, workload, and characteristics Achieves more than 37 times increased speed [28]
SparkSeq Sequence analysis Builds and runs genomic analysis pipelines in an interactive way by using Spark Achieves 8.4–9.15 times faster speeds than SeqPig; accelerates data querying up to 110 times and reduces memory consumption by 13 times [29]
CloudPhylo Phylogeny Evenly distributes entire workloads between worker nodes Shows good scalability and high efficiency; the Spark version is better than the Hadoop version [30]
S-CHEMO Drug discovery Intermediate data are immediately consumed again on the producing nodes, saving time and bandwidth Shows almost linearly increased speeds on up to eight nodes compared with the original pipeline [31]
Falco Single-cell RNA sequencing Consist of a splitting step, an optional preprocessing step, and the main analysis step At least 2.6x faster than a highly optimized single-node analysis; running time decreases with increasing numbers of nodes [32]
VariantSpark Variant association and population genetics studies Parallels population-scale tasks based on Spark and the associated MLlib 80% faster than ADAM, Hadoop/Mahout version, and ADMIXTURE; more than 90% faster than R and Python implementations [33]
SEQSpark Variant association and population genetics studies Splits large-scale datasets into many small blocks to perform rare variant association analyses Always faster than Variant Association Tools and PLINK/SEQ; in some cases, running time is reduced to 1% [34]
BioSpark Data-parallel analysis on large, numerical datasets Consists of a set of Java, C++, and Python libraries; abstractions for parallel analysis of standard data types; some APIs; and file conversion tools Convenient, scalable, and useful; has domain-specific features for biological applications [35]