| Algorithm 2 Pseudo code for learning algorithm for the selection of state action, policies, and rewards (Code Listing 2: Deep Q-learning algorithm) |
| 1 Initialize P (state, action) with some random value |
| 2 while (P! = terminal){ |
| 3 Initialize state |
| 4 While (state! = terminal){ |
| 5 Choose an action from the state by policy inferred from P |
| 6 Take action to action, observer, state’ |
| 7 P (state, action) ← P (state, action) + [r + max, |
| P (state’, action’) − P (state, action)] |
| 8 state ← state’ |
| 9 } |
| 10 } |