Delay-Informed Intelligent Formation Control for UAV-Assisted IoT Application

. 2023 Jul 6;23(13):6190. doi: 10.3390/s23136190

Algorithm 1 DIDDPG-based UAV formation algorithm.

1:
Initialize system parameters $\bar{P}, F_{j}, H_{j}, D_{j}^{1}, D_{j}^{2}$ and the replay memory buffer R.
2:
Randomly initialize $θ^{μ}$ , $θ^{Q}$ , $μ^{'}$ and $Q^{'}$ .
3:
Initialize online actor and critic networks $μ (s |θ^{μ})$ and $Q (s, a |θ^{Q})$ , respectively.
4:
for episode = $0 : 1 : N - 1$ do
5:
Initialize the random noise $ω$ and state $s_{0}$ .
6:
for $j = 0 : 1 : M - 1$ do
7:
Update the action $a_{j} = μ (s_{j} |θ^{μ}) + ω$ .
8:
Update the next state $s_{j + 1}$ based on (11) that $s_{j + 1} = s_{j} F_{j} + a_{j} H_{j} + [G_{j}^{T}, 0_{3 \times 1}]$ .
9:
Derive the reward $r_{j}$ by (12) that $r_{j} = - s_{j} \bar{P} s_{j}^{T} - a_{j + 1} Q a_{j + 1}^{T}$ .
10:
Store transition $(s_{j}, a_{j}, r_{j}, s_{j + 1})$ in R.
11:
Randomly Select a mini-batch of K experience samples $(s_{j}, a_{j}, r_{j}, s_{j + 1})$ from R.
12:
Update target Q value based on (15) that $y_{j} = r_{j} + γ Q^{'} (s_{j + 1}, μ^{'} (s_{j + 1} | θ^{μ^{'}}) | θ^{Q^{'}})$ .
13:
Update $θ^{Q}$ by minimizing the mean quadratic error function based on (14).
14:
Update $θ^{μ}$ by sampled policy gradient $\nabla_{θ^{μ}} J$ given by (16).
15:
Update the target networks:
16:
$\begin{array}{l} θ^{Q^{'}} \leftarrow δ θ^{Q} + (1 - δ) θ^{Q^{'}}, \\ θ^{μ^{'}} \leftarrow δ θ^{μ} + (1 - δ) θ^{μ^{'}} . \end{array}$
17:
end for
18:
end for