Skip to main content
[Preprint]. 2024 Jul 12:arXiv:2407.09100v1. [Version 1]

Figure 3: Architectures of winning solutions.

Figure 3:

Across all subplots: C: number of output channels in convolution layers, Cin: number of input channels, K: size of convolution kernels, S: stride, G: number of groups for convolution channels, B: batch size. Core: green, readout: blue. A: DwiseNeuro. The core is based on 3D factorised convolutions. The only solution whose readout was not based on the Gaussian readout (Lurz et al., 2021). B: Dynamic-V1FM. The core is transformer-based, the Gaussian readout is extended to look in different resolution to the core output, then to fuse different resolutions. Here w represents the readout linear weights learnt for each neuron. C: ViV1T. The idea is to replace the core with a spatiotemporal transformer. D: Ensembled factorized baseline.