View full-text article in PMC Sensors (Basel). 2022 Nov 6;22(21):8543. doi: 10.3390/s22218543 Search in PMC Search in PubMed View in NLM Catalog Add to search Copyright and License information © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). PMC Copyright notice Algorithm 1 Q-learning algorithm. 1:repeat 2: each data item for each mini-batch sample 3: using a greedy strategy, choose action ut, get reward rt, and reach a new state xt+1 4: Q(xt,ut)←Q(xt,ut)+α[rt+1+γmaxQ(xt+1,ut+1)−Q(xt,ut)] 5: xt←xt+1 6:until all Q(x,u) reach a state of convergence