[Preprint]. 2024 Jul 12:arXiv:2407.09100v1. [Version 1]

Table 3:

ViV1T core hyperparameter search space and their final settings. We performed Hyperband Bayesian optimization (Li et al., 2017) with 20 iterations to find the setting that yield the best single trial correlation in the validation set. The resulting ViV1T model contains 12M trainable parameters, about 13% more than the factorized baseline.

hyperparameter	search space	final value
Core
embedding dim.	uniform, min: 8, max: 512, step: 8	112
learning rate	uniform, min: 0.0001, max: 0.01	0.0048
patch dropout	uniform, min: 0, max: 0.5	0.1338
drop Path	uniform, min: 0, max: 0.5	0.0505
pos. encoding	none, learnable, sinusoidal	learnable
weight decay	uniform, min: 0, max: 1	0.1789
batch size	uniform, min: 1, max: 64	6
Spatial Transformer
num. blocks	uniform, min:1, max: 8, step: 1	3
patch size	uniform, min: 3, max: 16, step: 1	7
patch stride	uniform, min: 1, max: patch size, step: 1	2
Temporal Transformer
num. blocks	uniform, min: 1, max: 8, step: 1	5
patch size	uniform, min: 1, max: 50, step: 1	25
patch stride	uniform, min: 1, max: patch size, step: 1	1
multi-head attention (MHA) layer
num. heads	uniform, min: 1, max: 16, step: 1	11
head dim.	uniform, min: 8, max: 512, step: 8	48
MHA dropout	uniform, min: 0, max: 0.5	0.3580
feedforward (FF) layer
FF dim.	uniform, min: 8, max: 512, step: 8	136
FF activation	Tanh, Sigmoid, ELU, GELU, SwiGLU	GELU
FF dropout	uniform, min: 0, max: 0.5	0.0592