Application of Deep Reinforcement Learning to UAV Swarming for Ground Surveillance

. 2023 Oct 27;23(21):8766. doi: 10.3390/s23218766

Algorithm 2 SARSA

initialize $Q (s, a)$ arbitrarily, where s denotes the state of the agent and a denotes the action
for each episode do
initialize s
choose a from s using policy derived from Q
while s is not terminal state and steps number < max steps number do
take action a, observe reward r, and next state $s^{'}$
choose the next action $a^{'}$ from $s^{'}$ using policy derived from Q
$Q (s, a) \leftarrow Q (s, a) + α [r + γ Q (s^{'}, a^{'}) - Q (s, a)]$
$s \leftarrow s^{'}$
end while
end for