HASCom: A Heterogeneous Affective-Semantic Communication Framework for Speech Transmission

. 2026 Mar 31;26(7):2158. doi: 10.3390/s26072158

Algorithm 2 Stage II: Prior-Guided Affective Diffusion Optimization

Input: Frozen Modules: JSCC Encoder/Decoder

θ_{j s c c}

, Semantic Encoder

E_{c}

;

Trainable Modules: Structural Prior Network

P_{p r i o r}

(including Duration Predictor),

Diffusion U-Net

ϵ_{θ}

;

Data: Speech dataset

D

, Ground-truth Mel-spectrogram

X_{0}

;

Hyperparameters: Loss weights

λ_{a}, λ_{s}

, noise schedule

β_{t}

Output: Optimized Generative Parameters

θ_{g e n} = {P_{p r i o r}, ϵ_{θ}}

1:
Initialize $θ_{g e n}$ randomly.
2:
repeat
3:
for each batch $(W, X_{0}) \sim D$ do
4:
Digital Stream: Extract & Quantize semantic features:
5:
${\hat{Z}}_{c} \leftarrow Quantize (E_{c} (W))$
6:
Analog Stream: Transmit affective features via frozen JSCC:
7:
$Z_{a} = Agg (E_{a} (W))$
8:
${\hat{Z}}_{a} = D_{j s c c} (Channel (E_{j s c c} (Z_{a})))$
9:
Predict unaligned distributions: $\tilde{μ} = P_{p r i o r} ({\hat{Z}}_{c})$
10:
Monotonic Alignment Search (MAS):
11:
$A^{*} = MAS (\tilde{μ}, X_{0})$
12:
Get aligned semantic skeleton & durations:
13:
$μ = Align (\tilde{μ}, A^{*}), d_{m a s} = Sum (A^{*})$
14:
Compute Structural Losses:
15:
$L_{s} = ∥ μ - X_{0} ∥^{2}, L_{t} = {∥ log d_{p r e d} - log d_{m a s} ∥}^{2}$
16:
Sample time $t \sim [0, T]$ and noise $ξ \sim N (0, I)$
17:
Sample noisy state $X_{t}$ via OU forward process anchored at $μ$ :
18:
$X_{t} = ρ_{t} X_{0} + (1 - ρ_{t}) μ + δ_{t} ξ$
19:
Predict noise residual with Affective FiLM conditioning:
20:
$\hat{ξ} = ϵ_{θ} (X_{t}, μ, t, {\hat{Z}}_{a})$
21:
Compute Diffusion Loss:
22:
$L_{d} = {∥ \hat{ξ} - ξ ∥}^{2}$
23:
Aggregate Total Loss:
24:
$L_{t o t a l} = L_{d} + λ_{a} L_{a} + λ_{s} (L_{s} + L_{t})$
25:
Update gradients: $θ_{g e n} \leftarrow θ_{g e n} - η \nabla_{θ} L_{t o t a l}$
26:
end for
27:
until convergence
28:
return $θ_{g e n}$