Skip to main content
[Preprint]. 2024 Jul 12:arXiv:2407.09100v1. [Version 1]

Table 3:

ViV1T core hyperparameter search space and their final settings. We performed Hyperband Bayesian optimization (Li et al., 2017) with 20 iterations to find the setting that yield the best single trial correlation in the validation set. The resulting ViV1T model contains 12M trainable parameters, about 13% more than the factorized baseline.

hyperparameter search space final value
Core
embedding dim. uniform, min: 8, max: 512, step: 8 112
learning rate uniform, min: 0.0001, max: 0.01 0.0048
patch dropout uniform, min: 0, max: 0.5 0.1338
drop Path uniform, min: 0, max: 0.5 0.0505
pos. encoding none, learnable, sinusoidal learnable
weight decay uniform, min: 0, max: 1 0.1789
batch size uniform, min: 1, max: 64 6
Spatial Transformer
num. blocks uniform, min:1, max: 8, step: 1 3
patch size uniform, min: 3, max: 16, step: 1 7
patch stride uniform, min: 1, max: patch size, step: 1 2
Temporal Transformer
num. blocks uniform, min: 1, max: 8, step: 1 5
patch size uniform, min: 1, max: 50, step: 1 25
patch stride uniform, min: 1, max: patch size, step: 1 1
multi-head attention (MHA) layer
num. heads uniform, min: 1, max: 16, step: 1 11
head dim. uniform, min: 8, max: 512, step: 8 48
MHA dropout uniform, min: 0, max: 0.5 0.3580
feedforward (FF) layer
FF dim. uniform, min: 8, max: 512, step: 8 136
FF activation Tanh, Sigmoid, ELU, GELU, SwiGLU GELU
FF dropout uniform, min: 0, max: 0.5 0.0592