Maximum Entropy Exploration in Contextual Bandits with Neural Networks and Energy Based Models

. 2023 Jan 18;25(2):188. doi: 10.3390/e25020188

Algorithm 2 Contextual bandit with Energy Based Models

Input: $N, θ_{0}, X_{0}, K, c, α, a_{m a x}, a_{m i n}, η, σ$
for $i = 1, \dots, N$ do
Choose $a_{i} \sim π_{i}$ with SGLD, ${\tilde{a}}^{0} \sim U (a_{m i n}, a_{m a x})$
for $k = 1, \dots, K$ do
Draw sample for noise $ω \sim N (0, σ)$
${\tilde{a}}^{k} \leftarrow {\tilde{a}}^{k - 1} - η \nabla_{x} E_{θ_{i - 1}} ({\tilde{a}}^{k - 1}, s_{i}) / α + ω$
end for
Play action ${\tilde{a}}^{K}$ , receive $r_{i}$ , update $X$
Every c steps train $E_{θ}$ in batches:
$θ_{i} = {arg min}_{θ} \sum_{X} log (1 + e^{E_{θ} (a^{+}, s_{j}) - E_{θ} (a^{-}, s_{j})})$
end for