Skip to main content
. 2020 Nov 27;3:591315. doi: 10.3389/fdata.2020.591315

TABLE 1.

Decomposition of CLUE execution time in the case of 104 points per layer with 100 layers. The time of subprocesses on GPU is measured with NVIDIA profiler, while that on CPU is measured with std::chrono timers in the C++ code. The uncertainties are the standard deviations of 200 trial runs of the same event (10,000 trial runs if GPU). The uncertainties of subprocesses on GPU are negligible given that the maximum and minimum kernel execution time measured by NVIDIA Profiler are very close. With respect to the single-threaded CPU, the speed-up factors of the multi-threaded CPU with TBB and the GPU are given in the bracket. “mem mgmt + overhead” represents the time spent in handling and copying data, together with the overhead of issuing instructions to the GPU.

CLUE step CPU [1T] (baseline) CPU TBB [10T] GPU
Build fixed-grid spatial index 59.3 ± 1.6 ms 117.7 ± 6.4 ms (0.50x) 0.28 ms (208.6x)
Calculate local density 218.4 ± 2.5 ms 33.7 ± 2.6 ms (6.48x) 0.51 ms (430.6x)
Calculate nearest-higher and separation 326.9 ± 2.9 ms 45.5 ± 2.5 ms (7.19x) 0.89 ms (368.5x)
Decide seeds/outliers, register followers 54.4 ± 2.5 ms 109.4 ± 7.7 ms (0.50x) 0.34 ms (162.4x)
Expand clusters 17.4 ± 1.5 ms 6.1 ± 1.3 ms (2.86x) 0.35 ms (49.7x)
Mem mgmt + overhead 29.1 ± 1.7 ms 44.9 ± 15.7 ms 4.27 ms
TOTAL (10,000 points per layer) 705.5 ± 7.9 ms 357.2 ± 19.7 ms (2.0x) 6.63 ± 0.63 ms (106.4x)