Table 2.
Performance limitation | Global memory bandwidth, GB/s | Registers per thread | Shared memory per block, bytes | Theoretical occupancy, % | Achieved occupancy, % | |
---|---|---|---|---|---|---|
Naive | Memory bandwidth | 207 | 48 | 0 | 62.5 | 62.1 |
Tile = 2 | Memory bandwidth | 201 | 72 | 4096 | 43.8 | 43.3 |
Tile = 4 | Memory bandwidth | 193 | 73 | 8192 | 37.5 | 37.1 |
Tile = 6 | Instruction and memory latency | 154 | 104 | 12 288 | 25 | 24.9 |
Tile = 8 | Instruction and memory latency | 101 | 142 | 16 384 | 18.8 | 18.2 |