Skip to main content
. 2025 Nov 22;16(12):1187. doi: 10.3390/insects16121187
Algorithm 1: Explainable Analysis Module (EAM) Forward and Training Pseudocode

Input: Visual feature map AR8×8×512, raw audio x, class label c, cross-modal attention marginal Mc, class gating mask Πc

Output: Cross-modal heatmap Hc, explanation loss Lxai

Visual explanation branch:

AvConv1×1(A;512128)//(Channel reduction(

HvDeconvk=4,s=2×3(Av;12864321)//(Upsampling 8×864×64(

HvGradCAM++(A,c)//(Generate visual saliency map(

HvResizeNorm(Hv,64×64)//(Align spatial size and normalize(

Acoustic explanation branch:

SMelSpec(x)//(Convert audio to Mel-spectrogram(

TTok2Grid(S)//(Reshape to 8×4×512 grid(

AaConv1×1(T;512128)//(Channel reduction(

HaDeconvk=4,s=2×3(Aa;12864321)//(Upsampling 8×464×64(

HaIntegratedGradients(S,c)//(Generate time–frequency saliency map(

HaResizeNorm(Ha,64×64)

Gated fusion:

U[Hv,Ha]//(Concatenate along channel dimension (64×64×2)(

GσConv1×1GELU(Conv1×1(U;216));161//(Pixel-wise gating mask G[0,1](

HcGHv+(1G)Ha//(Cross-modal fused explanation map(

Training-phase losses:

H˜cSoftmax(vec(Hc)), M˜cSoftmax(vec(Mc))//(Distribution normalization(

LalignDKL(H˜cM˜c)//(Attention–explanation alignment loss(

LtvHc1//(Spatial smoothness regularization(

FBackboneFeat(·), RgateFΠcF22//(Class-wise gating regularization(

Lxaiμ1Ltv+μ2Lalign+μ3Rgate//(Total explanation loss(

returnHc,Lxai