Fig. 2. Confusion matrix for action recognition on the predicted spans on the BABEL-TAL-20 (BT-20) dataset.
A cell contains the number of samples that are wrongly predicted in action classification. The color of a cell along the diagonal represents the precision of a particular class. Notably, we observe that the action “stand” is frequently mistaken for several other actions, highlighting the challenging and diverse nature of our dataset, as well as the precision of our annotations. Additionally, we have noticed that the actions “throw” and “catch” are often misclassified by the model, which can be attributed to their frequent occurrence in sequence, likely leading to localization errors causing misalignment between predicted and ground-truth action spans. Furthermore, despite its lower precision, the action “grasp” is frequently predicted and commonly confused with “lift” because they may share similar motion patterns, features, and temporal context.