Skip to main content
. 2019 Nov 11;8:e47463. doi: 10.7554/eLife.47463
Algorithm 1 Q-λ (and related models):
For SARSA-λ we replace the expression maxa~Q(sn+1,a~) in line 9 by Q(sn+1,an+1) where an+1 is the action taken in the next state sn+1. For Q-0 and SARSA-0 we set λ to zero.
1: Algorithm Parameters: learning rate α(0,1], discount factor γ[0,1], eligibility trace decay factor λ[0,1], temperature T(0,) of softmax policy p, biasb[0,1] for preferred action aprefA.
2: Initialize Q(s,a)=0 and e(s,a)=0 for all sS,aA
For preferred action aprefA set Q(s,apref)=b
3:        for each episode do
4:               Initialize state snS
5:               Initialize step n=1
6:               while sn is not terminal do
7:                     Choose action anA from sn with softmax policy p derived from Q
8:                     Take action an, and observe rn+1R and sn+1S
9:                     RPE(nn+1)rn+1+γmaxa~Q(sn+1,a~)Q(sn,an)
10:                     en(sn,an)1
11:                     for all sS,aA do
12:         Q(s,a)Q(s,a)+αRPE(nn+1)en(s,a)
13:         en+1(s,a)γλen(s,a)
14                      nn+1