| Algorithm 1 Learning process, out-of-policy |
|
Require:, arbitrarilty, and for each do Initialize agent a with sates s at time for each do Choose A from S using derived from Q Take action , observe R, end for until is terminal, hence the PE is fully mutated end for |