. 2024 Mar 17;14:6420. doi: 10.1038/s41598-024-56259-z

Table 11.

Comparison of inference speedup for our compression method on GPT-2. Evaluation is conducted by generating one sentence at a time autoregressively.

Method	Compression percentage	Number of parameters	(GPU)		(CPU)
Method	Compression percentage	Number of parameters	Latency	Speedup	Latency	Speedup
GPT-2	0%	345 M	553 ms	1.00×	1,683 ms	1.00×
Our method	10%	280 M	440 ms	1.25×	1,255 ms	1.34×
	20%	19 M	307 ms	1.80×	868 ms	1.93×
	30%	80 M	234 ms	2.36×	582 ms	2.89×
	40%	72 M	167 ms	3.31×	457 ms	3.68×
	50%	62 M	132 ms	4.18×	367 ms	4.58×
	65%	60 M	90 ms	6.14×	200 ms	8.41×

Significant values are in bold.