Table 4.
Performance limitation | Shared memory bandwidth, GB/s | Registers per thread | Shared memory per block, bytes | Theoretical occupancy, % | Achieved occupancy, % | |
---|---|---|---|---|---|---|
Naive | Memory bandwidth | 2213 | 32 | 2124 | 100 | 99.8 |
Tile = 2 | Memory bandwidth | 2005 | 39 | 3660 | 75 | 74.9 |
Warp | Instruction and memory latency | 362 | 32 | 6540 | 10.9 | 10.9 |
Warp* | Compute | 262 | 44 | 2000 | 62.5 | 62.5 |