Maximum Entropy Exploration in Contextual Bandits with Neural Networks and Energy Based Models

. 2023 Jan 18;25(2):188. doi: 10.3390/e25020188

Algorithm 1 Maximum entropy exploration with neural networks

Input: $α, N, θ_{0}, X_{0}, k$
for $i = 1, \dots, N$ do
Receive context $s_{i}$ and choose $a_{i} \sim π_{i}$ where
$π_{i} (a | s_{i}) = \frac{e^{{\hat{r}}_{θ_{i - 1}} (a, s_{i}) / α}}{\int_{a^{'}} e^{{\hat{r}}_{θ_{i - 1}} (a^{'}, s_{i}) / α} d a^{'}}$
Agent receives reward $r_{i}$
Add the triplet ${s_{i}, a_{i}, r_{i}}$ to the dataset $X$
Every k steps train the model ${\hat{r}}_{θ}$ :
$θ_{i} = {arg min}_{θ} \sum_{{s_{j}, a_{j}, r_{j}} \in X} | r_{j} - {\hat{r}}_{θ} (a_{j}, s_{j}) |$
end for