(a) Decomposition of geometry into columns, each associated with a thread block. Each column is ‘padded’, with the top tile being initially vacant. (b) For a given thread block, the post-collision distribution on a tile (orange) is written into shared memory (peach) with memory locations determined by streaming. Moments are recomputed for the previous tile (purple), but are shifted one tile below when writing back to global memory (red). (c) This shift is based on tiled circular array shifting, updated from bottom to top, with the moments of each updated tile written to the location beneath it.