| AVER | Audio video emotion recognition |
| CNN | Convolution Neural Network |
| Crema-d [33] | Crowd-sourced Emotional multi-modal Actors Dataset |
| DNN | Deep Neural Network |
| HCI | Human–Computer Interaction |
| FC | Fully connected layer |
| IEMOCAP [30] | Interactive emotional dyadic motion capture dataset |
| LSTM | Long Short Term Memory |
| MRPN | multi-modal Residual Perceptron Network |
| RAVDESS [34] | The Ryerson Audio–Visual Database of Emotional Speech and Song |
| RP | Residual Perceptron |
| SAC | Sequence Aggregation Component |
| SOTA | State of the Art Solution |
| STFT | Short-term Fourier transformation |
| SVM | Support Vector Machine |
| VIT [38] | Vision Transformer |