| Algorithm 1: Explainable Analysis Module (EAM) Forward and Training Pseudocode |
|
Input: Visual feature map , raw audio x, class label c, cross-modal attention marginal , class gating mask Output: Cross-modal heatmap , explanation loss Visual explanation branch: //Channel reduction //Upsampling //Generate visual saliency map //Align spatial size and normalize Acoustic explanation branch: //Convert audio to Mel-spectrogram //Reshape to grid //Channel reduction //Upsampling //Generate time–frequency saliency map
Gated fusion: //Concatenate along channel dimension () //Pixel-wise gating mask //Cross-modal fused explanation map Training-phase losses: , //Distribution normalization //Attention–explanation alignment loss //Spatial smoothness regularization , //Class-wise gating regularization //Total explanation loss return |