Skip to main content
. 2022 Jan 18;55(6):4755–4808. doi: 10.1007/s10462-021-10116-x

Table 4.

Vision-based HAR models

C1 C2 C3 C4 C5 C6 C7 C8
# Features Feature extraction ML/DL model Architecture Metrics Validation Hyper-parameters/optimizer/loss function CIT*
R1 T. domain Hand-crafted 3DCNN on Color-skl-MHI and RJI I/p layer with skeletal joints, Color-skl-MHI followed by 3D-DCNN, RJI followed by 3D-DCNN, decision fusion, o/p Accuracy Cross validation DO ratios for the three hidden layers (0.1%, 0.2%,0.3%)/SGD Phyo et al. (2019)
R2 Spatio temporal Automatic VGG-16, VGG-19, inception v3 224 × 224 is image is input and features from fc1 layer are extracted which gives 4096-dimensional vector for per image Accuracy, precision, recall, F1-score 10% data is used for validation All 3 CNNs trained on imageNet then trained on Weizmann reusing same weights Deep and Zheng (2019)
R3 Spatio temporal Automatic ResNet-50 C1, MP, C2-C5, AP, FC (2048), SM Accuracy Evaluates UCF-101 and HMDB-51 BS- 128, DO, LR: 10-4 and 10-5/SGD Feichtenhofer et al. (2017)
R4 Spatio temporal Automatic ResNet-50 Raw clip i/P C1- P- C2- C3- C4- C5- GAP- FC- No. of classes. Pre-train on Kinetics-400, Kinetics-600 and kinetics-700 mAPS, GFLOPS Evaluate model performance on AVA dataset LR, WD: 10-7, Batch normalization/SGD Feichtenhofer and Ai (2019)
R5 Spatio temporal Automatic MERS model with ResNeXt-101 MERS: Train using flow, freeze weights, train with RGB using MSE loss. MARS: Train using privileged flow n/w, freeze weights, use RGB frames during test phase top-1 mean accuracy Kinetics 40: 20 k, MiniKinetics: 5 k WD = 0.0005, LR = 0.1, momentum = 0.9 and LR = 0.1 for 64f-clips/SGD/Cross entropy Crasto et al. (2019)
R6 Spatio temporal Automatic HATnet based on ResNet-50 and STCnet 2D ConvNets: to extract spatial structure, 3DConv: to deal with interaction in frames. Both 2D and 3D use ResNet-50 Top-1 mAPS Kinetics 400 and 600 Fine tune on UCF-101 & HMDB-51/Cross entropy Diba et al. (2020)
R7 Spatio-temporal Automatic 2D ResNet 50 with STM blocks Video frames i/p, C1, C2x, C3x, C4x, C5x, FC, o/p. Replace all residual block with STM block (1 × 1 2D conv, followed by CMM and CSTM blocks, then 1 × 1 2D Conv) top-1, top-5 accuracy Kinetics 400: 19,095 LR = 0.01, LR = 0.001 for 25 epochs, momentum = 0.9, WD = 2.5 e-4/SGD Jiang et al. (2019)

T time, F frequency, CV cross validation, LOSO leave one subject out, C convolution, P pooling, AP average pooling, MP max pooling, FC fully connected, SM softmax, BN batch normalization layer, LR learning rate, DO dropout, BS batch size, SGD stochastic gradient descent, mAPs mean average precision, GFLOP giga floating point operations per second, Spec specificity, Sens sensitivity, AUC area under curve, EER equal error rate, TL transfer Learning

*CIT citations