Skip to main content
. 2023 Aug 4;99:106546. doi: 10.1016/j.ultsonch.2023.106546

Table 3.

Hardware features of different Nvidia GPU architectures (full implementation). The notation DP stands for double precision.

Volta Ampere Hopper
V100 A100 H100
compute capability 7.0 8.0 9.0
clock rate (MHz) 1530 1410 1755
SM count 84 128 144
DP units per SM 32 32 64
DP units per GPU 2688 4096 9216
peak DP computing power (TFLOPS) 8.2 11.6 32.3
max registers per thread (DP) 127 127 127
max shared memory per SM (kB) 96 164 228
max L2 cache per GPU (MB) 6 40 50
global memory bandwidth (GB/s) 900 1555 3352
global memory latency (clock cycles) 400 400 400
shared memory bandwidth (GB/s) 16451 23101 32348
shared memory latency (clock cycles) 20 20 20
L2 cache bandwidth (GB/s) 3133 7219 8986
L2 cache latency (clock cycles) 200 200 200
arithmetic operation latency (cl. cyc.) 4 4 4
required memory bandwidth (GB/s) 67381 94623 264996
bandwidth ratio (required/global) 75 61 79
bandwidth ratio (required/L2 cache) 22 13 29
bandwidth ratio (required/shared) 4 4 8