Skip to main content
. 2023 Jan 20;23(3):1231. doi: 10.3390/s23031231
Algorithm 1 Learning process, out-of-policy
Require:Q(st,At), stS,AtA arbitrarilty, and Q(terminalstate,·)=0
    for each st do
    Initialize agent a with sates s at time t+1
        for each st+1 do
          Choose A from S using Λ derived from Q
          Take action At, observe R, st+1
          Q(st,At)Q(st,At)+α[R(st+1,At+1)+ΦmaxQ(st+1+At+1)Q(st,At)]
          stst+1
        end for
    until st is terminal, hence the PE is fully mutated
    end for