View full-text article in PMC Entropy (Basel). 2023 Jan 18;25(2):188. doi: 10.3390/e25020188 Search in PMC Search in PubMed View in NLM Catalog Add to search Copyright and License information © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). PMC Copyright notice Algorithm 2 Contextual bandit with Energy Based Models Input: N,θ0,X0,K,c,α,amax,amin,η,σ for i=1,…,N do Choose ai∼πi with SGLD, a˜0∼U(amin,amax) for k=1,…,K do Draw sample for noise ω∼N(0,σ) a˜k←a˜k−1−η∇xEθi−1(a˜k−1,si)/α+ω end for Play action a˜K, receive ri, update X Every c steps train Eθ in batches: θi=arg minθ∑Xlog(1+eEθ(a+,sj)−Eθ(a−,sj)) end for