Skip to main content
. 2023 Sep 18;44(5):967–980. doi: 10.24272/j.issn.2095-8137.2022.449

Figure 2.

Figure 2

Two-stream action recognition model and heatmaps of feature visualization

A: Overview of TSSA Network architecture. (i) Video segments are randomly sampled as input; (ii) Two video modalities, RGB and optical flow, serve as inputs in the two-stream model; (iii) Separate networks with the same architecture, each containing Res-blocks as backbone and shift and split attention modules in blocks; (iv) Output from previous block is used as input for feature extraction in the next block. Single-stream net predicts action scores using average fusion, and class scores are combined for the final prediction. B: Grad-Cam++ heat maps of action recognition. Heat maps obtained from test videos classified under the trained model. Colors represent different weights (ranging from 0–1, blue to red) signifying the importance of the area related to the prediction result. Red area in the frames provided the most important discriminative features used by the model in the final predictions.