View full-text article in PMC Sensors (Basel). 2024 Mar 14;24(6):1870. doi: 10.3390/s24061870 Search in PMC Search in PubMed View in NLM Catalog Add to search Copyright and License information © 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). PMC Copyright notice Algorithm 1 REINFORCE 1:Input α learning rate, γ discounted factor 2:Initialize environment E 3:Initialize policy parameters θ 4:for episode in 1 …N do 5: Use π(s|θ)tocollect|E|trajectories:S0,A0,R0,…,RT 6: G=0 7: for t = T−1 … 0 do 8: G=Rt+γG 9: Compute entropy regularization ERt=−∑απ(At|St)Logπ(At|St) 10: J(θt)^=γtGLogπ(At|St,θt)−ERt 11: θt+1=θt+α∇J(θt)^ 12: end for 13:end for