Table 3.
Hardware features of different Nvidia GPU architectures (full implementation). The notation DP stands for double precision.
| Volta | Ampere | Hopper | |||
|---|---|---|---|---|---|
| V100 | A100 | H100 | |||
| compute capability | 7.0 | 8.0 | 9.0 | ||
| clock rate (MHz) | 1530 | 1410 | 1755 | ||
| SM count | 84 | 128 | 144 | ||
| DP units per SM | 32 | 32 | 64 | ||
| DP units per GPU | 2688 | 4096 | 9216 | ||
| peak DP computing power (TFLOPS) | 8.2 | 11.6 | 32.3 | ||
| max registers per thread (DP) | 127 | 127 | 127 | ||
| max shared memory per SM (kB) | 96 | 164 | 228 | ||
| max L2 cache per GPU (MB) | 6 | 40 | 50 | ||
| global memory bandwidth (GB/s) | 900 | 1555 | 3352 | ||
| global memory latency (clock cycles) | 400 | 400 | 400 | ||
| shared memory bandwidth (GB/s) | 16451 | 23101 | 32348 | ||
| shared memory latency (clock cycles) | 20 | 20 | 20 | ||
| L2 cache bandwidth (GB/s) | 3133 | 7219 | 8986 | ||
| L2 cache latency (clock cycles) | 200 | 200 | 200 | ||
| arithmetic operation latency (cl. cyc.) | 4 | 4 | 4 | ||
| required memory bandwidth (GB/s) | 67381 | 94623 | 264996 | ||
| bandwidth ratio (required/global) | 75 | 61 | 79 | ||
| bandwidth ratio (required/L2 cache) | 22 | 13 | 29 | ||
| bandwidth ratio (required/shared) | 4 | 4 | 8 |