Table 3.
Hadoop I | HPC random | ||||
---|---|---|---|---|---|
Number of nodes (cores) | Mapping Talignment | Number of nodes (cores) | Mapping Talignment time,minutes | ||
4(28) | 293.5 | 1.71 | 4(64) | 74.4 | 3.89 |
6(42) | 189.8 | 1.62 | 10(160) | 32.4 | 3.76 |
8(56) | 136.0 | 1.62 | 14(224) | 22.7 | 3.77 |
16(112) | 70.3 | 1.48 | 18(288) | 17.9 | 3.78 |
32(224) | 39.3 | 1.66 | 22(352) | 14.5 | 3.79 |
40(280) | 32.5 | 1.65 | 26(416) | 12.3 | 3.77 |
30(480) | 10.7 | 3.73 | |||
34(544) | 9.5 | 3.45 | |||
38(608) | 8.5 | 3.16 | |||
42(672) | 7.6 | 2.96 | |||
46(736) | 7.0 | 2.55 | |||
50(800) | 6.4 | 2.65 | |||
54(864) | 5.9 | 2.34 | |||
58(928) | 5.5 | 2.12 |
For the ‘HPC random’ approach, data chunks first have to be copied to the local node disks, and the alignments (SAM files) are copied back, while Hadoop keeps all of the data inside HDFS and, hence, does not need data staging. However, Hadoop needs to ingest the data to HDFS and preprocess the reads before the actual mapping stage so as to be able to operate in an MR manner, resulting in what we term ‘communication costs’. Note that each HPC node has 16 cores, while each Hadoop node has seven cores (the eighth core is dedicated to run the virtual machine).