|
Algorithm 2 SARSA |
initialize arbitrarily, where s denotes the state of the agent and a denotes the action
for each episode do
initialize s
choose a from s using policy derived from Q
while s is not terminal state and steps number < max steps number do
take action a, observe reward r, and next state
choose the next action from using policy derived from Q
end while
end for
|