TABLE 2.
Platform | RAM (GB) | Runtime (h) | Speedup | GPU speedup | No. of chunks |
---|---|---|---|---|---|
Original CPU Xeon Gold 6242 | 5.5 | 498 | 1× | 36 | |
CPU Mobile i7-8850H | Not collected | 10 | 50× | 12 | |
CPU Xeon Gold 6242 | 148 | 3 | 166× | 1× | 1 |
GPU Mobile GTX 1050 Max-Q | 3.6 | 3 | 166× | 1× | 36 |
GPU T4 | 38 | 0.68 | 730× | 4.4× | 4 |
GPU RTX2080TI | 27 | 0.32 | 1,560× | 9.4× | 6 |
GPU V100 PCIE 32GB | 75 | 0.22 | 2,260× | 13.6× | 2 |
GPU RTX3090 | 51 | 0.19 | 2,600× | 15.8× | 3 |
Speedup is relative to performance on the same data using Striped UniFrac from McDonald et al. (10). In all cases, all available compute resources for an architecture were utilized. Peak resident memory for the runs is provided; however, the amount of maximum memory used for processing is a function of how many chunks are processed at one time. The largest memory use comes from creating the distance matrix that is N2 to the number of samples (not shown) and is effectively invariant to the architecture.