Deep-Reinforcement-Learning-Based Joint Energy Replenishment and Data Collection Scheme for WRSN

. 2024 Apr 9;24(8):2386. doi: 10.3390/s24082386

Algorithm 2 MODDPG

Input:
State space $s_{r l}$
Output:
The action of UAV $a_{r l}$
1:
Initialize Actor network parameters $θ^{μ}$ , and Target Actor network parameters $θ^{μ^{'}}$ , $θ^{μ^{'}} \leftarrow θ^{μ}$ .
2:
Initialize Critic network parameters $θ^{Q}$ , and Target Critic network parameters $θ^{Q^{'}}$ , $θ^{Q^{'}} \leftarrow θ^{Q}$ .
3:
Initialize experience Replay Buffer.
4:
for $episode = 0 to$ M do
5:
Initialize $s_{r l}$ .
6:
for time step = $t_{1}, t_{2}, \dots, t_{n}$ do
7:
repeat
8:
With probability of $ϵ$ choose an action $a_{r l} = c l i p (μ (s_{r l} | Q_{μ}) + ϵ, a_{l o w}, a_{h i g h})$ .
9:
Perform action $a_{r l}$ and observe reward $r_{r l}$ and next state $s_{r l}^{'}$ .
10:
Store the transition $(s_{r l}, a_{r l}, r_{r l}, s_{r l}^{'})$ from P.
11:
if P ≥ Batch size then
12:
Randomly sample Mini batch transitions $(s_{r l}, a_{r l}, r_{r l}, s_{r l}^{'})$ from P.
13:
Compute $y_{r l}$ (19).
14:
Update Critic network by minimizing the critic loss (20).
15:
Update Actor network by maximizing the actor loss (21).
16:
Soft Update Target Network Parameters.
17:
$θ^{Q^{'}} \leftarrow τ θ^{Q} + (1 - τ) θ^{Q^{'}}$

$θ^{μ^{'}} \leftarrow τ θ^{μ} + (1 - τ) θ^{μ^{'}}$
18:
end if
19:
until $\sum_{n = 0}^{N} t_{n} \geq T$
20:
end for
21:
end for