The kernel is initiated with thread-blocks and threads, where P is the total number of projections. The work flow of a thread-block in each iteration is divided into two stages. In stage A the pixels of reference slices are fetched through texture memory, interpolated, and stored in shared memory. This data is then exhaustively reused in stage B, where groups of threads compute the differences to the corresponding translated image components. Individual threads within a group work with different image components, , of each reference slice, . Collectively all threads iterate through the components of each reference slice, for a total of components for each iteration . The final result is reduced back into shared memory through atomic reduction operations. All image components are covered as goes from 1 to , where C is the total number of Fourier components. A reduced sum of differences for each pair of orientation and translation is written to global memory prior to the kernel exiting.
DOI:
http://dx.doi.org/10.7554/eLife.18722.012