One-shot learning and behavioral eligibility traces in sequential decision making

. 2019 Nov 11;8:e47463. doi: 10.7554/eLife.47463

Algorithm 1 Q-λ (and related models): For SARSA-λ we replace the expression $\max_{\tilde{a}} Q (s_{n + 1}, \tilde{a})$ in line 9 by $Q (s_{n + 1}, a_{n + 1})$ where $a_{n + 1}$ is the action taken in the next state $s_{n + 1}$ . For Q-0 and SARSA-0 we set λ to zero.
1:	Algorithm Parameters: learning rate $α \in (0, 1]$ , discount factor $γ \in [0, 1]$ , eligibility trace decay factor $λ \in [0, 1]$ , temperature $T \in (0, \infty)$ of softmax policy $p$ , $b i a s b \in [0, 1]$ for preferred action $a_{p r e f} \in A$ .
2:	Initialize $Q (s, a) = 0$ and $e (s, a) = 0$ for all $s \in S, a \in A$ For preferred action $a_{p r e f} \in A$ set $Q (s, a_{p r e f}) = b$
3:	for each episode do
4:	Initialize state $s_{n} \in S$
5:	Initialize step $n = 1$
6:	while $s_{n}$ is not terminal do
7:	Choose action $a_{n} \in A$ from $s_{n}$ with softmax policy $p$ derived from $Q$
8:	Take action $a_{n}$ , and observe $r_{n + 1} \in R$ and $s_{n + 1} \in S$
9:	$R P E (n \to n + 1) \leftarrow r_{n + 1} + γ {m a x}_{\tilde{a}} Q (s_{n + 1}, \tilde{a}) - Q (s_{n}, a_{n})$
10:	$e_{n} (s_{n}, a_{n}) \leftarrow 1$
11:	for all $s \in S, a \in A$ do
12:	$Q (s, a) \leftarrow Q (s, a) + α R P E (n \to n + 1) e_{n} (s, a)$
13:	$e_{n + 1} (s, a) \leftarrow γ λ e_{n} (s, a)$
14	$n \leftarrow n + 1$