Algorithm 1
Q-λ (and related models): For SARSA-λ we replace the expression in line 9 by where is the action taken in the next state . For Q-0 and SARSA-0 we set λ to zero. | ||
1: | Algorithm Parameters: learning rate , discount factor , eligibility trace decay factor , temperature of softmax policy , for preferred action . | |
2: | Initialize and for all
For preferred action set |
|
3: | for each episode do | |
4: | Initialize state | |
5: | Initialize step | |
6: | while is not terminal do | |
7: | Choose action from with softmax policy derived from | |
8: | Take action , and observe and | |
9: | ||
10: | ||
11: | for all do | |
12: | ||
13: | ||
14 |