Batch Mode Reinforcement Learning based on the Synthesis of Artificial Trajectories

. Author manuscript; available in PMC: 2013 Sep 16.

Published in final edited form as: Ann Oper Res. 2013 Sep 1;208(1):383–416. doi: 10.1007/s10479-012-1248-5

Algorithm 2 CGRL algorithm.

Input:

F_{n} = {(x^{l}, u^{l}, r^{l}, y^{l})}_{l = 1}^{n}, L_{f}, L_{ρ}, x_{0}, T

Initialization:

D ← n × (T − 1) matrix initialized to zero;

A ← n–dimensional vector initialized to zero;

B ← n–dimensional vector initialized to zero;

Computation of the Lipschitz constants

{L_{Q_{N}}^{'}}_{N = 1}^{T}

L_{Q_{1}}^{'} = L_{ρ}

;

for k = 2 … T do

L_{Q_{k}}^{'} \leftarrow L_{ρ} + L_{f} L_{Q_{k - 1}}^{'}

;

end for

t ← T − 2;

while t > −1 do

for i = 1 … n do

j_{0} \leftarrow \underset{j \in {1, \dots, n}}{arg max} r^{j} - L_{Q_{T - t - 1}}^{'} {║ y^{i} - x^{j} ║}_{X} + B (j)

;

m_{0} \leftarrow max_{j \in {1, \dots, n}} r^{j} - L_{Q_{T - t - 1}}^{'} {║ y^{i} - x^{j} ║}_{X} + B (j)

;

A(i) ← m₀;

D(i, t + 1) ← j₀; \\ best tuple at t + 1 if in tuple i at time t

end for

B ← A;

t = t −1;

end while

Conclusion:

S ← (T + 1)–length vector of actions initialized to zero;

l \leftarrow \underset{j \in {1, \dots, n}}{arg max} r^{j} - L_{Q_{T}}^{'} {║ x_{0} - x^{j} ║}_{X} + B (j)

;

S (T + 1) \leftarrow \max_{j \in {1, \dots, n}} r^{j} - L_{Q_{T}}^{'} {║ x_{0} - x^{j} ║}_{X} + B (j)

; \\ best lower bound

S(1) ← u^l; \\ CGRL action for t = 0.

for t = 0 … T − 2 do

l′ ← D(l, t + 1);

S(t + 2,:) ← u^l′; other CGRL actions

l ← l′;

end for

Return: S