Table 3.

Hardware features of different Nvidia GPU architectures (full implementation). The notation DP stands for double precision.

	Volta	Ampere	Hopper
	V100	A100	H100
compute capability	7.0	8.0	9.0
clock rate (MHz)	1530	1410	1755
SM count	84	128	144
DP units per SM	32	32	64
DP units per GPU	2688	4096	9216
peak DP computing power (TFLOPS)	8.2	11.6	32.3
max registers per thread (DP)	127	127	127
max shared memory per SM (kB)	96	164	228
max L2 cache per GPU (MB)	6	40	50
global memory bandwidth (GB/s)	900	1555	3352
global memory latency (clock cycles)	$\approx$ 400	$\approx$ 400	$\approx$ 400
shared memory bandwidth (GB/s)	16451	23101	32348
shared memory latency (clock cycles)	$\approx$ 20	$\approx$ 20	$\approx$ 20
L2 cache bandwidth (GB/s)	3133	7219	$\approx$ 8986
L2 cache latency (clock cycles)	$\approx$ 200	$\approx$ 200	$\approx$ 200
arithmetic operation latency (cl. cyc.)	$\approx$ 4	$\approx$ 4	$\approx$ 4
required memory bandwidth (GB/s)	67381	94623	264996
bandwidth ratio (required/global)	75	61	79
bandwidth ratio (required/L2 cache)	22	13	29
bandwidth ratio (required/shared)	4	4	8