TABLE 1.
Decomposition of CLUE execution time in the case of points per layer with 100 layers. The time of subprocesses on GPU is measured with NVIDIA profiler, while that on CPU is measured with std::chrono timers in the C++ code. The uncertainties are the standard deviations of 200 trial runs of the same event (10,000 trial runs if GPU). The uncertainties of subprocesses on GPU are negligible given that the maximum and minimum kernel execution time measured by NVIDIA Profiler are very close. With respect to the single-threaded CPU, the speed-up factors of the multi-threaded CPU with TBB and the GPU are given in the bracket. “mem mgmt + overhead” represents the time spent in handling and copying data, together with the overhead of issuing instructions to the GPU.
| CLUE step | CPU [1T] (baseline) | CPU TBB [10T] | GPU |
|---|---|---|---|
| Build fixed-grid spatial index | 59.3 ± 1.6 ms | 117.7 ± 6.4 ms (0.50x) | 0.28 ms (208.6x) |
| Calculate local density | 218.4 ± 2.5 ms | 33.7 ± 2.6 ms (6.48x) | 0.51 ms (430.6x) |
| Calculate nearest-higher and separation | 326.9 ± 2.9 ms | 45.5 ± 2.5 ms (7.19x) | 0.89 ms (368.5x) |
| Decide seeds/outliers, register followers | 54.4 ± 2.5 ms | 109.4 ± 7.7 ms (0.50x) | 0.34 ms (162.4x) |
| Expand clusters | 17.4 ± 1.5 ms | 6.1 ± 1.3 ms (2.86x) | 0.35 ms (49.7x) |
| Mem mgmt + overhead | 29.1 ± 1.7 ms | 44.9 ± 15.7 ms | 4.27 ms |
| TOTAL (10,000 points per layer) | 705.5 ± 7.9 ms | 357.2 ± 19.7 ms (2.0x) | 6.63 ± 0.63 ms (106.4x) |