Detection of Static and Mobile Targets by an Autonomous Agent with Deep Q-Learning Abilities

. 2022 Aug 22;24(8):1168. doi: 10.3390/e24081168

Algorithm 2. Training the prediction neural network

Network structure:

input layer:

2 n

neurons (

n

agent positions and

n

target location probabilities, both relative to the size

n

of the domain),

hidden layer:

2 n

neurons,

output layer:

9

neurons (in accordance with the number of possible actions).

Activation function:

sigmoid function

f (x) = 1 / (1 + e^{- x})

Loss function:

mean square error (MSE) function.

Input: domain

C = \{c_{1}, c_{2}, \dots, c_{n}\}

set

A = \{↑, ↗, \to, ↘, ↓, ↙, \leftarrow, ↖, ⊙\}

of possible actions,

probability

p_{T A}

of true alarms (Equation (3)),

rate

α

of false alarms and their probability

p_{F A} = α p_{T A}

(Equation (4)),

sensor sensitivity

λ

discount factor

γ

objective probability map

P^{*}

(obtained by using the value

ε

number

r

of iterations for updating the weights,

initial value

η

(Equation (22)) and its discount factor

δ

learning rate

ρ

(with respect to the type of optimizer),

number

M

of epochs,

initial weights

w

of the prediction network and initial weights

w^{'} = w

of the target network,

training data set (that is, the

L \times N

table of

(c, P)

pairs created by Procedure 1).

Output: The trained prediction network.

1. Create the prediction network.

2. Create the target network as a copy of the prediction network.

3. For each epoch

j = 1, \dots, M

do:

4. For each pair

(c, P)

from the training data set, do:

5. For each action

a \in A

do:

6. Calculate the value

Q (c, P, a; w)

with the prediction network.

7. Calculate the probability

p (a | Q; η)

(Equation (22)).

8. End for.

9. Choose an action according to the probabilities

p (a | Q; η)

10. Apply the chosen action and set the next position

c^{'} = a (c)

of the agent.

11. Calculate the next probability map

P^{'}

with Equations (20) and (21).

12. If

P = P^{*}

c^{'} \notin C

, then

13. Set the immediate reward

R (a) = 0

14. Else

15. Calculate the immediate reward

R (a)

with respect to

P

and

P^{'}

(Equation (14)).

16. End if.

17. For each action

a \in A

do:

18. If

P = P^{*}

then

19. Set

Q (c^{'}, P^{'}, a; w^{'}) = 0

20. Else

21. Calculate the value

Q (c^{'}, P^{'}, a; w^{'})

with the target network.

22. End if.

23. End for.

24. Calculate the target value

Q^{+} = R (a) + γ \max_{a \in A} Q (c^{'}, P^{'}, a; w^{'})

(Equation (17)).

25. Calculate the temporal difference learning error as

Δ_{l} (Q) = Q^{+} - Q (c, P, a; w)

for the chosen action

a

(Equation (19)) and set

Δ_{l} (Q) = 0

for all other actions.

26. Update the weights

w

in the prediction network by backpropagation with respect to the error

Δ_{l} (Q)

27. Every

r

iterations, set the weights of the target network as

w^{'} = w

28. End for.