Table 5.
Implementations | Runtime, ms | Speedup |
---|---|---|
Java on CPU14 | 10 800 | 1× |
C++ Baseline14 | 1267 | 9× |
Inter Xeon AVX 1 Core14 | 138 | 78× |
Intel Xeon 24 Cores14 | 15 | 720× |
Alter OpenCL (Arria 10)13 | 2.8 | 3857× |
PE Ring (Arria 10)14 | 2.6 | 4154× |
NVIDIA Tesla K40 GPU14 | 70 | 154× |
Naive intratask | 14.2 | 761× |
Tile = 2 intratask | 15.5 | 696× |
Warp based | 20.6 | 524× |
Improved warp based | 12.8 | 843× |
Naive intertask | 76.6 | 141× |
Tile = 2 intertask | 45.1 | 239× |
Tile = 4 intertask | 29.6 | 365× |
Tile = 6 intertask | 26.4 | 409× |
Tile = 8 intertask | 24.9 | 433.7× |