Skip to main content
. 2018 Mar 12;14:1176934318760543. doi: 10.1177/1176934318760543

Table 5.

Performance comparison of various implementations on a “10s” data set.

Implementations Runtime, ms Speedup
Java on CPU14 10 800
C++ Baseline14 1267
Inter Xeon AVX 1 Core14 138 78×
Intel Xeon 24 Cores14 15 720×
Alter OpenCL (Arria 10)13 2.8 3857×
PE Ring (Arria 10)14 2.6 4154×
NVIDIA Tesla K40 GPU14 70 154×
Naive intratask 14.2 761×
Tile = 2 intratask 15.5 696×
Warp based 20.6 524×
Improved warp based 12.8 843×
Naive intertask 76.6 141×
Tile = 2 intertask 45.1 239×
Tile = 4 intertask 29.6 365×
Tile = 6 intertask 26.4 409×
Tile = 8 intertask 24.9 433.7×