Skip to main content
. 2026 Mar 31;26(7):2158. doi: 10.3390/s26072158
Algorithm 2 Stage II: Prior-Guided Affective Diffusion Optimization
Input: Frozen Modules: JSCC Encoder/Decoder θjscc, Semantic Encoder Ec;
     Trainable Modules: Structural Prior Network Pprior (including Duration Predictor),
     Diffusion U-Net ϵθ;
     Data: Speech dataset D, Ground-truth Mel-spectrogram X0;
     Hyperparameters: Loss weights λa,λs, noise schedule βt.
Output: Optimized Generative Parameters θgen={Pprior,ϵθ}.
  •  1:

    Initialize θgen randomly.

  •  2:

    repeat

  •  3:

       for each batch (W,X0)D do

  •  4:

         Digital Stream: Extract & Quantize semantic features:

  •  5:

            Z^cQuantize(Ec(W))

  •  6:

         Analog Stream: Transmit affective features via frozen JSCC:

  •  7:

            Za=Agg(Ea(W))

  •  8:

            Z^a=Djscc(Channel(Ejscc(Za)))

  •  9:

         Predict unaligned distributions: μ˜=Pprior(Z^c)

  • 10:

         Monotonic Alignment Search (MAS):

  • 11:

            A*=MAS(μ˜,X0)

  • 12:

         Get aligned semantic skeleton & durations:

  • 13:

            μ=Align(μ˜,A*),dmas=Sum(A*)

  • 14:

         Compute Structural Losses:

  • 15:

            Ls=μX02,Lt=logdpredlogdmas2

  • 16:

         Sample time t[0,T] and noise ξN(0,I)

  • 17:

         Sample noisy state Xt via OU forward process anchored at μ:

  • 18:

            Xt=ρtX0+(1ρt)μ+δtξ

  • 19:

         Predict noise residual with Affective FiLM conditioning:

  • 20:

            ξ^=ϵθ(Xt,μ,t,Z^a)

  • 21:

         Compute Diffusion Loss:

  • 22:

            Ld=ξ^ξ2

  • 23:

         Aggregate Total Loss:

  • 24:

            Ltotal=Ld+λaLa+λs(Ls+Lt)

  • 25:

         Update gradients: θgenθgenηθLtotal

  • 26:

       end for

  • 27:

    until convergence

  • 28:

    return θgen