Joint Beamforming, Power Allocation, and Splitting Control for SWIPT-Enabled IoT Networks with Deep Reinforcement Learning and Game Theory

. 2022 Mar 17;22(6):2328. doi: 10.3390/s22062328

Algorithm 2 The single-layer DDPG-based MRA training algorithm.

1:
(Input) $λ_{i}, μ_{i}, ν_{i}, \forall i$ , batch size $η$ , actor learning rate $α_{a}$ , critic learning rate $α_{c}$ , decay rate d, discount factor $ζ$ , and soft update parameter $τ$ ;
2:
(Output) Learned actor/critic to decide $P_{i}, θ_{i}, f_{i}, \forall i$ , for (7);
3:
Initialize actor $Q_{a} (s; ω_{a})$ , critic $Q (s, a; ω_{c})$ , action $a^{(0)}$ , replay buffer D, and set initial decay rate $d^{(0)} = 1$ ;
4:
for episode = 1 to $M$ do
5:
Initialize state $s^{(0)}$ and $ρ^{(0)}$ ;
6:
for time $t = 1$ to $N$ do
7:
Normalize state $s^{(t)}$ with (32);
8:
Execute action $a^{(t)}$ in (30), obtain reward $r^{(t)} = U^{(t)}$ with (23), and observe new state $s^{'}$ ;
9:
if replay buffer D is not full then
10:
Store transition $(s^{(t)}, a^{(t)}, r^{(t)}, s^{'})$ in D;
11:
else
12:
Replace the oldest one in buffer D with $(s^{(t)}, a^{(t)}, r^{(t)}, s^{'})$ ;
13:
Set $d^{(t)} = d^{(t - 1)} \cdot d$ ;
14:
Randomly choose $η$ stored transitions from D;
15:
Update the critic online network by minimizing the loss function in (36);
16:
Update the actor online network with the gradient obtained by (37);
17:
Soft update the target networks with their parameters updated by (29);
18:
$s^{(t)} = s^{'}$ ;
19:
end if
20:
end for
21:
end for