Cross-Modal and Contrastive Optimization for Explainable Multimodal Recognition of Predatory and Parasitic Insects

. 2025 Nov 22;16(12):1187. doi: 10.3390/insects16121187

Algorithm 1: Explainable Analysis Module (EAM) Forward and Training Pseudocode

Input: Visual feature map $A \in R^{8 \times 8 \times 512}$ , raw audio x, class label c, cross-modal attention marginal $M^{c}$ , class gating mask $Π_{c}$

Output: Cross-modal heatmap $H^{c}$ , explanation loss $L_{xai}$

Visual explanation branch:

$A_{v} \leftarrow {Conv}_{1 \times 1} (A; 512 \to 128)$ //Channel reduction

$H_{v}^{↑} \leftarrow {Deconv}_{k = 4, s = 2}^{\times 3} (A_{v}; 128 \to 64 \to 32 \to 1)$ //Upsampling $8 \times 8 \to 64 \times 64$

$H_{v} \leftarrow {GradCAM}^{+ +} (A, c)$ //Generate visual saliency map

$H_{v} \leftarrow ResizeNorm (H_{v}, 64 \times 64)$ //Align spatial size and normalize

Acoustic explanation branch:

$S \leftarrow MelSpec (x)$ //Convert audio to Mel-spectrogram

$T \leftarrow Tok 2 Grid (S)$ //Reshape to $8 \times 4 \times 512$ grid

$A_{a} \leftarrow {Conv}_{1 \times 1} (T; 512 \to 128)$ //Channel reduction

$H_{a}^{↑} \leftarrow {Deconv}_{k = 4, s = 2}^{\times 3} (A_{a}; 128 \to 64 \to 32 \to 1)$ //Upsampling $8 \times 4 \to 64 \times 64$

$H_{a} \leftarrow IntegratedGradients (S, c)$ //Generate time–frequency saliency map

$H_{a} \leftarrow ResizeNorm (H_{a}, 64 \times 64)$

Gated fusion:

$U \leftarrow [H_{v}, H_{a}]$ //Concatenate along channel dimension ( $64 \times 64 \times 2$ )

$G \leftarrow σ ({Conv}_{1 \times 1} (GELU ({Conv}_{1 \times 1} (U; 2 \to 16)); 16 \to 1))$ //Pixel-wise gating mask $G \in [0, 1]$

$H^{c} \leftarrow G ⊙ H_{v} + (1 - G) ⊙ H_{a}$ //Cross-modal fused explanation map

Training-phase losses:

${\tilde{H}}^{c} \leftarrow Softmax (vec (H^{c}))$ , ${\tilde{M}}^{c} \leftarrow Softmax (vec (M^{c}))$ //Distribution normalization

$L_{align} \leftarrow D_{KL} ({\tilde{H}}^{c} ∥ {\tilde{M}}^{c})$ //Attention–explanation alignment loss

$L_{tv} \leftarrow {∥ \nabla H^{c} ∥}_{1}$ //Spatial smoothness regularization

$F \leftarrow BackboneFeat (\cdot)$ , $R_{gate} \leftarrow {∥ F - Π_{c} ⊙ F ∥}_{2}^{2}$ //Class-wise gating regularization

$L_{xai} \leftarrow μ_{1} L_{tv} + μ_{2} L_{align} + μ_{3} R_{gate}$ //Total explanation loss

return $H^{c}, L_{xai}$