| Algorithm 1 Q–Learning Algorithm |
|
Input: State set S, Action set A, Reward function R Output: Q table 1: Initialize Q table 2: while the number of iterations do 3: Return s to the initial state 4: while s is not terminal do 5: Choose a from A using policy derived from (e.g., greedy) 6: Take action a, observe r, 7: 8: 9: end while 10: end while 11: Return Q |