Skip to main content
. Author manuscript; available in PMC: 2022 Dec 22.
Published in final edited form as: IEEE Trans Cybern. 2021 Dec 22;51(12):5717–5727. doi: 10.1109/TCYB.2019.2958912

Algorithm 2.

Incremental Model Learning for Stochastic Environment (ModelLearning)

(1) Calculate the variation of the continuous state and the reward using equation (3)
(2) A normalized procedure is employed to calculated a normalized variation vector, v(ΔxNa,rNa)
(3) sϕ(xt)
(4) Retrieve all clusters, C = {c1, c2, …, cl}, from the cell of the model, M(s, a) using state-action pair, (s, a), |C| is the total number of cluster in this cell.
(5) if |C| == 0 then
(6) The first variation vector is set as the center of the first cluster, i.e., c1 = v, c1C
(7) else if |C| > 0 then
(8) dcl_argminclCD(cl,v)
(9) if dcl_>Dtha then
(10) Create a new cluster, cl+1 = v, cl+1C
(11) else
(12) Activate the cluster, cl_, and retrieve the information from this cluster, Δx¯Ncl_a, (Δx¯Ncl_a), (r¯Ncl_a) and (r¯Ncl_a)
(13) Calculate the new Δx¯Ncl_+1a, the new (Δx¯Ncl_+1a), the new (r¯Ncl_+1a) and the new (r¯Ncl_+1a) using the equations (5)(8).
(14) Ncl_Ncl_+1
(15) end if
(16) end if
(17) Update the variation transition function, Tp(Δx¯Ncl_+1a|s,a), and the reward function, Rp(s′, s, a)
Tp(Δx¯Ncl_+1a|s,a)Δx¯Ncl_+1aRp(s,s,a)r¯Ncl_+1a
(18) Store these two functions in the model, M(s, a).
M(s,a)(Tp(Δx¯Ncl_+1a|s,a),Rp(s,s,a))
(19) return M(s, a)