Skip to main content
. 2025 Mar 30;25(7):2197. doi: 10.3390/s25072197
Algorithm 4 Dynamic Sampling with D4PG
Input: Environment, replay buffer D, Q-network parameters θonline and θtarget, exploration rate ϵ, learning rate α, discount factor γ, target network update frequency C, target network update rate τ
Output: Optimal action sequence for task offloading and resource allocation
  •  1:

    Initialize θonline and θtarget

  •  2:

    for each global iteration g do

  •  3:

      Observe current state sg

  •  4:

      Select action ag:

  •  5:

      if random(0,1)<ϵ then

  •  6:

       Choose a random subset of UEs as ag

  •  7:

      else

  •  8:

       Choose ag=argmaxaQ(sg,a;θonline)

  •  9:

      end if

  • 10:

     Sample subset Ksampled from ag

  • 11:

     Record (sg,ag) in D

  • 12:

     Conduct local training and obtain reward R

  • 13:

     Sample mini-batch (s,a,r,s) from D

  • 14:

     Set target:

  • 15:

    if s is terminal then

  • 16:

      target=r

  • 17:

    else

  • 18:

      target=r+γE[Q(s,a;θtarget)]

  • 19:

    end if

  • 20:
     Update θonline by minimizing the loss:
    L=Q(s,a;θonline)target2
  • 21:
     Every C steps, update θtarget by:
    θtargetτθonline+(1τ)θtarget
  • 22:

    end for