|
Algorithm 1 Q-learning |
initialize arbitrarily, where s denotes the state of the agent and a denotes the action
for each episode do
initialize s
while s is not terminal state and steps number < max steps number do
choose a from s using policy derived from Q
take action a, observe reward r, and next state
end while
end for
|