Table 1:
self-attention | input sequence length n | |||||
---|---|---|---|---|---|---|
512 | 2048 | 8192 | ||||
memory (MB) | time (ms) | memory (MB) | time (ms) | memory (MB) | time (ms) | |
Transformer | 54 (1×) | 0.8 (1×) | 685 (1×) | 10.0 (1×) | 10233 (1×) | 155.4 (1×) |
Linformer-256 | 41 (1.3×) | 0.7 (1.1×) | 165 (4.2×) | 2.7 (3.6×) | 635 (16.1×) | 11.3 (13.8×) |
Longformer-257 | 32.2 (1.7×) | 2.4 (0.3×) | 130 (5.3×) | 9.2 (1.0×) | 455 (22.5×) | 36.2 (4.3×) |
Nyströmformer-64 | 35 (1.5×) | 0.7 (1.1 ×) | 118 (5.8×) | 2.7 (3.6×) | 450 (22.8×) | 12.3 (12.7×) |
Nyströmformer-32 | 26 (2.1×) | 0.6 (1.2×) | 96 (7.1×) | 2.6 (3.7×) | 383 (26.7×) | 11.5 (13.4×) |