Mobility-Aware Resource Allocation in IoRT Network for Post-Disaster Communications with Parameterized Reinforcement Learning

. 2023 Jul 17;23(14):6448. doi: 10.3390/s23146448

Algorithm 1: Multi pass DQN (MP-DQN) Algorithm.

Input: Probability distribution

ξ

, mini batch size

B

, exploration parameter

ε

, learning rates

{α_{a}, α_{a, p}}

.
Initialization: actor weights

ω, ω^{-}

and actor parameter weights (

θ, θ^{-}

)
For t = 1, 2, 3, T do
Estimate the action parameters

z_{j} (s (t); θ (t))

by actor network
Choose the action

a (t) = (j, z_{j})

based on the

ε

greedy policy:

a (t) = \{\begin{array}{l} r a d o m s a m p l e a c c o r d i n g t o p r o b a b i l i t y d i s t r i b u t i o n ξ, w i t h ε \\ (j, z_{j}) : j = {a r g m a x}_{c \in A_{d}} Q (s (t), j, z e_{j}; ω), w i t h (1 - ε) \end{array}

Execute action

a (t)

, receive immediate reward

r (s (t), a (t))

and next state

s (t + 1)

Save the experience (

s (t), a (t), r (t), s (t + 1)

) into replay memory
Select mini batch size

B

randomly from the replay memory
Define the target y(t) by

y (t) = r (t) + γ \max_{j^{'} \in A_{d}} Q (s (t + 1), j^{'}, z_{j}^{'} (s (t + 1); θ^{-}); ω^{-})

Select the diagonal element from

(\begin{matrix} Q_{11} & \dots & Q_{1 c} \\ ⋮ & ⋱ & ⋮ \\ Q c 1 & \dots & Q_{c c} \end{matrix})

Choose the best action

j

by argmax from diagonal elements
Use the

(y (t), s (t), a (t))

to estimate the gradients

𝛻_{ω} L^{x} (ω)

and

𝛻_{θ} L^{x} (θ)

Update the weights parameters

ω, ω^{-}, θ, θ^{-}