. 2018 Aug 7;7(8):giy098. doi: 10.1093/gigascience/giy098

Table 1:

Bioinformatics tools and algorithms based on Apache Spark

Name	Function	Features	Pros/Cons	Reference
SparkSW	Alignment and mapping	Consists of three phases: data preprocessing, SW as map tasks, and top K records as reduce tasks	Load-balancing, scalable, but without the mapping location and traceback of optimal alignment	[16]
DSA	Alignment and mapping	Leverages data parallel strategy based on SIMD instruction	Up to 201 times increased speed over SparkSW and almost linearly increased speed with increasing numbers of cluster nodes	[17]
CloudSW	Alignment and mapping	Leverages SIMD instruction, and provides API services in the cloud	Up to 3.29 times increased speed over DSA and 621 times increased speed over SparkSW; high scalability and efficiency	[18]
SparkBWA	Alignment and mapping	Consists of three main stages: RDD creation, map, and reduce phases; employs two independent software layers	For shorter reads, averages 1.9x and 1.4x faster than SEAL and pBWA. For longer reads, averages 1.4x faster than BigBWA and Halvade, but requires the data availability in HDFS	[19]
StreamBWA	Alignment and mapping	Input data are streamed into the cluster directly from a compressed file	∼2x faster than nonstreaming approach, and 5x faster than SparkBWA	[20]
PASTASpark	Alignment and mapping	Employs an in-memory RDD of key-value pairs to parallel the calculating MSA phase	Up to 10x faster than single-threaded PASTA; ensures scalability and fault tolerance	[21]
PPCAS	Alignment and mapping	Based on the MapReduce processing paradigm in Spark	Better with a single node and shows almost linearly increased speeds with increasing numbers of nodes	[22]
SparkBLAST	Alignment and mapping	Utilizes pipe operator and Spark RDDs to call BLAST as an external library	Outperforms CloudBLAST in terms of speed, scalability, and efficiency	[23]
MetaSpark	Alignment and mapping	Consists of five steps: constructing k-mer RefindexRDD, constructing k-mer ReadlistRDD, seeding, filtering, and banded alignment	Recruits significantly more reads than SOAP2, BWA, and LAST and more reads by ∼4 than FR-HIT; shows good scalability and overall high performance	[24]
Spaler	Assembly	Employs Spark's GraphX API; consists of two main parts: de Bruijn graph construction and contig generation	Shows better scalability and achieves comparable or better assembly quality than ABySS, Ray, and SWAP-Assembler	[25]
SA-BR-Spark	Assembly	Under the strategy of finding the source of reads; based on the Spark platform	Shows a superior computational speed than SA-BR-MR	[26]
HiGene	Sequence analysis	Puts forward a dynamic computing resource scheduler and an efficient way of mitigating data skew	Reduces total running time from days to just under nearly an hour; 2x faster than Halvade	[27]
GATK-Spark	Sequence analysis	Takes full account of compute, workload, and characteristics	Achieves more than 37 times increased speed	[28]
SparkSeq	Sequence analysis	Builds and runs genomic analysis pipelines in an interactive way by using Spark	Achieves 8.4–9.15 times faster speeds than SeqPig; accelerates data querying up to 110 times and reduces memory consumption by 13 times	[29]
CloudPhylo	Phylogeny	Evenly distributes entire workloads between worker nodes	Shows good scalability and high efficiency; the Spark version is better than the Hadoop version	[30]
S-CHEMO	Drug discovery	Intermediate data are immediately consumed again on the producing nodes, saving time and bandwidth	Shows almost linearly increased speeds on up to eight nodes compared with the original pipeline	[31]
Falco	Single-cell RNA sequencing	Consist of a splitting step, an optional preprocessing step, and the main analysis step	At least 2.6x faster than a highly optimized single-node analysis; running time decreases with increasing numbers of nodes	[32]
VariantSpark	Variant association and population genetics studies	Parallels population-scale tasks based on Spark and the associated MLlib	80% faster than ADAM, Hadoop/Mahout version, and ADMIXTURE; more than 90% faster than R and Python implementations	[33]
SEQSpark	Variant association and population genetics studies	Splits large-scale datasets into many small blocks to perform rare variant association analyses	Always faster than Variant Association Tools and PLINK/SEQ; in some cases, running time is reduced to 1%	[34]
BioSpark	Data-parallel analysis on large, numerical datasets	Consists of a set of Java, C++, and Python libraries; abstractions for parallel analysis of standard data types; some APIs; and file conversion tools	Convenient, scalable, and useful; has domain-specific features for biological applications	[35]