Table 1:
Name | Function | Features | Pros/Cons | Reference |
---|---|---|---|---|
SparkSW | Alignment and mapping |
Consists of three phases: data preprocessing, SW as map tasks, and top K records as reduce tasks | Load-balancing, scalable, but without the mapping location and traceback of optimal alignment | [16] |
DSA | Alignment and mapping | Leverages data parallel strategy based on SIMD instruction | Up to 201 times increased speed over SparkSW and almost linearly increased speed with increasing numbers of cluster nodes | [17] |
CloudSW | Alignment and mapping | Leverages SIMD instruction, and provides API services in the cloud | Up to 3.29 times increased speed over DSA and 621 times increased speed over SparkSW; high scalability and efficiency | [18] |
SparkBWA | Alignment and mapping | Consists of three main stages: RDD creation, map, and reduce phases; employs two independent software layers | For shorter reads, averages 1.9x and 1.4x faster than SEAL and pBWA. For longer reads, averages 1.4x faster than BigBWA and Halvade, but requires the data availability in HDFS | [19] |
StreamBWA | Alignment and mapping | Input data are streamed into the cluster directly from a compressed file | ∼2x faster than nonstreaming approach, and 5x faster than SparkBWA | [20] |
PASTASpark | Alignment and mapping | Employs an in-memory RDD of key-value pairs to parallel the calculating MSA phase | Up to 10x faster than single-threaded PASTA; ensures scalability and fault tolerance | [21] |
PPCAS | Alignment and mapping | Based on the MapReduce processing paradigm in Spark | Better with a single node and shows almost linearly increased speeds with increasing numbers of nodes | [22] |
SparkBLAST | Alignment and mapping | Utilizes pipe operator and Spark RDDs to call BLAST as an external library | Outperforms CloudBLAST in terms of speed, scalability, and efficiency | [23] |
MetaSpark | Alignment and mapping | Consists of five steps: constructing k-mer RefindexRDD, constructing k-mer ReadlistRDD, seeding, filtering, and banded alignment | Recruits significantly more reads than SOAP2, BWA, and LAST and more reads by ∼4 than FR-HIT; shows good scalability and overall high performance | [24] |
Spaler | Assembly | Employs Spark's GraphX API; consists of two main parts: de Bruijn graph construction and contig generation | Shows better scalability and achieves comparable or better assembly quality than ABySS, Ray, and SWAP-Assembler | [25] |
SA-BR-Spark | Assembly | Under the strategy of finding the source of reads; based on the Spark platform | Shows a superior computational speed than SA-BR-MR | [26] |
HiGene | Sequence analysis | Puts forward a dynamic computing resource scheduler and an efficient way of mitigating data skew | Reduces total running time from days to just under nearly an hour; 2x faster than Halvade | [27] |
GATK-Spark | Sequence analysis | Takes full account of compute, workload, and characteristics | Achieves more than 37 times increased speed | [28] |
SparkSeq | Sequence analysis | Builds and runs genomic analysis pipelines in an interactive way by using Spark | Achieves 8.4–9.15 times faster speeds than SeqPig; accelerates data querying up to 110 times and reduces memory consumption by 13 times | [29] |
CloudPhylo | Phylogeny | Evenly distributes entire workloads between worker nodes | Shows good scalability and high efficiency; the Spark version is better than the Hadoop version | [30] |
S-CHEMO | Drug discovery | Intermediate data are immediately consumed again on the producing nodes, saving time and bandwidth | Shows almost linearly increased speeds on up to eight nodes compared with the original pipeline | [31] |
Falco | Single-cell RNA sequencing | Consist of a splitting step, an optional preprocessing step, and the main analysis step | At least 2.6x faster than a highly optimized single-node analysis; running time decreases with increasing numbers of nodes | [32] |
VariantSpark | Variant association and population genetics studies | Parallels population-scale tasks based on Spark and the associated MLlib | 80% faster than ADAM, Hadoop/Mahout version, and ADMIXTURE; more than 90% faster than R and Python implementations | [33] |
SEQSpark | Variant association and population genetics studies | Splits large-scale datasets into many small blocks to perform rare variant association analyses | Always faster than Variant Association Tools and PLINK/SEQ; in some cases, running time is reduced to 1% | [34] |
BioSpark | Data-parallel analysis on large, numerical datasets | Consists of a set of Java, C++, and Python libraries; abstractions for parallel analysis of standard data types; some APIs; and file conversion tools | Convenient, scalable, and useful; has domain-specific features for biological applications | [35] |