| AAML | Additive Angular Margin Loss |
| AST | Audio Spectrogram Transformer |
| AUC | Area Under the Receiver Operating Characteristic (ROC) Curve |
| BERT | Bidirectional Encoder Representations from Transformers |
| CENS | Chroma Energy Normalized Statistics |
| CNN | Convolutional Neural Networks |
| CQT | Constant Q-Transform |
| CRNN | Convolutional Recurrent Neural Networks |
| DCNN | Deep Convolutional Neural Networks |
| DenseNet | Dense Convolutional Network |
| DL | Deep Learning |
| DNN | Deep Neural Network |
| ESC | Environmental Sound Classification |
| FN | False Negative |
| FP | False Positive |
| GPU | Graphics Processing Unit |
| LSTM | Long Short-Term Memory |
| M2M-AST | Many-to-Many Audio Spectrogram Transformer |
| MFCC | Mel Frequency Cepstral Coefficients |
| NLP | Natural Language Processing |
| pp | percentage points |
| ResNet | Residual Neural Network |
| RNN | Recurrent Neural Networks |
| STFT | Short-Term Fourier Transformation |
| TFCNN | Temporal-Frequency attention-based Convolutional Neural Network |
| TP | True Positive |
| VATT | Video–Audio–Text Transformer |