Application of Deep Reinforcement Learning to UAV Swarming for Ground Surveillance

. 2023 Oct 27;23(21):8766. doi: 10.3390/s23218766

Algorithm 3 PPO

for each iteration do
for T time steps do
Run previous policy $π_{θ_{o l d}}$ and retrieve (state, action, reward)
end for
Compute advantage estimates ${\hat{A}}_{1}$ ,…, ${\hat{A}}_{T}$ of each taken action
Optimize L function with regard to $θ$ in K epochs and minibatch size $< T$
Update policy parameters: $θ_{o l d} \leftarrow θ$
end for