Skip to main content
. 2018 Mar 12;14:1176934318760543. doi: 10.1177/1176934318760543

Table 4.

Profiling results of the intratask implementations with synthetic data set 26 (warp* stands for the improved warp based).

Performance limitation Shared memory bandwidth, GB/s Registers per thread Shared memory per block, bytes Theoretical occupancy, % Achieved occupancy, %
Naive Memory bandwidth 2213 32 2124 100 99.8
Tile = 2 Memory bandwidth 2005 39 3660 75 74.9
Warp Instruction and memory latency 362 32 6540 10.9 10.9
Warp* Compute 262 44 2000 62.5 62.5