Table 11.
Comparison of inference speedup for our compression method on GPT-2. Evaluation is conducted by generating one sentence at a time autoregressively.
| Method | Compression percentage | Number of parameters | (GPU) | (CPU) | ||
|---|---|---|---|---|---|---|
| Latency | Speedup | Latency | Speedup | |||
| GPT-2 | 0% | 345 M | 553 ms | 1.00× | 1,683 ms | 1.00× |
| Our method | 10% | 280 M | 440 ms | 1.25× | 1,255 ms | 1.34× |
| 20% | 19 M | 307 ms | 1.80× | 868 ms | 1.93× | |
| 30% | 80 M | 234 ms | 2.36× | 582 ms | 2.89× | |
| 40% | 72 M | 167 ms | 3.31× | 457 ms | 3.68× | |
| 50% | 62 M | 132 ms | 4.18× | 367 ms | 4.58× | |
| 65% | 60 M | 90 ms | 6.14× | 200 ms | 8.41× | |
Significant values are in bold.