. 2022 Jan 18;55(6):4755–4808. doi: 10.1007/s10462-021-10116-x

Table 4.

Vision-based HAR models

	C1	C2	C3	C4	C5	C6	C7	C8
	# Features	Feature extraction	ML/DL model	Architecture	Metrics	Validation	Hyper-parameters/optimizer/loss function	CIT*
R1	T. domain	Hand-crafted	3DCNN on Color-skl-MHI and RJI	I/p layer with skeletal joints, Color-skl-MHI followed by 3D-DCNN, RJI followed by 3D-DCNN, decision fusion, o/p	Accuracy	Cross validation	DO ratios for the three hidden layers (0.1%, 0.2%,0.3%)/SGD	Phyo et al. (2019)
R2	Spatio temporal	Automatic	VGG-16, VGG-19, inception v3	224 × 224 is image is input and features from fc1 layer are extracted which gives 4096-dimensional vector for per image	Accuracy, precision, recall, F1-score	10% data is used for validation	All 3 CNNs trained on imageNet then trained on Weizmann reusing same weights	Deep and Zheng (2019)
R3	Spatio temporal	Automatic	ResNet-50	C1, MP, C2-C5, AP, FC (2048), SM	Accuracy	Evaluates UCF-101 and HMDB-51	BS- 128, DO, LR: $10^{- 4}$ and $10^{- 5}$ /SGD	Feichtenhofer et al. (2017)
R4	Spatio temporal	Automatic	ResNet-50	Raw clip i/P C1- P- C2- C3- C4- C5- GAP- FC- No. of classes. Pre-train on Kinetics-400, Kinetics-600 and kinetics-700	mAPS, GFLOPS	Evaluate model performance on AVA dataset	LR, WD: $10^{- 7}$ , Batch normalization/SGD	Feichtenhofer and Ai (2019)
R5	Spatio temporal	Automatic	MERS model with ResNeXt-101	MERS: Train using flow, freeze weights, train with RGB using MSE loss. MARS: Train using privileged flow n/w, freeze weights, use RGB frames during test phase	top-1 mean accuracy	Kinetics 40: 20 k, MiniKinetics: 5 k	WD = 0.0005, LR = 0.1, momentum = 0.9 and LR = 0.1 for 64f-clips/SGD/Cross entropy	Crasto et al. (2019)
R6	Spatio temporal	Automatic	HATnet based on ResNet-50 and STCnet	2D ConvNets: to extract spatial structure, 3DConv: to deal with interaction in frames. Both 2D and 3D use ResNet-50	Top-1 mAPS	Kinetics 400 and 600	Fine tune on UCF-101 & HMDB-51/Cross entropy	Diba et al. (2020)
R7	Spatio-temporal	Automatic	2D ResNet 50 with STM blocks	Video frames i/p, C1, C2x, C3x, C4x, C5x, FC, o/p. Replace all residual block with STM block (1 × 1 2D conv, followed by CMM and CSTM blocks, then 1 × 1 2D Conv)	top-1, top-5 accuracy	Kinetics 400: 19,095	LR = 0.01, LR = 0.001 for 25 epochs, momentum = 0.9, WD = 2.5 $e^{- 4}$ /SGD	Jiang et al. (2019)

T time, F frequency, CV cross validation, LOSO leave one subject out, C convolution, P pooling, AP average pooling, MP max pooling, FC fully connected, SM softmax, BN batch normalization layer, LR learning rate, DO dropout, BS batch size, SGD stochastic gradient descent, mAPs mean average precision, GFLOP giga floating point operations per second, Spec specificity, Sens sensitivity, AUC area under curve, EER equal error rate, TL transfer Learning

*CIT citations