The Convergence of a Cooperation Markov Decision Process System

. 2020 Aug 30;22(9):955. doi: 10.3390/e22090955

Algorithm 1 The Optimal Strategy Pairs

Input: input

C M D P

quintuple

< S, A, {A g e n t_{0}, A g e n t_{1}}, {T_{0}, T_{1}}, {R_{0}, R_{1}} >

: error parameter,

ε > 0

balance parameter

0 < α < 1

, discount factor

0 < γ < 1

.
Output: output Optimal strategy pair

(π_{*}^{0}, π_{*}^{1}) = (π_{k}^{0}, π_{k}^{1})

, optimal value function

V_{k + 1} = V_{k + 1}^{(π_{*}^{0}, π_{*}^{1})}

Initialization V-function $V_{1}^{0}, V_{1}^{1}$ , take the initial value randomly. Such as $V_{1}^{1} (s) = 0 (s \in S)$ ; suppose $V_{1} = α V_{1}^{0} + (1 - α) V_{1}^{1}$ ;
Use Equation (19) to greedy improvement strategy pair $(π_{1}^{0}, π_{1}^{1})$ : $(V_{1}^{(π_{1}^{0}, π_{1}^{1})} = V_{1})$ ;
Use the updated strategy pair $(π_{1}^{0}, π_{1}^{1})$ in step 2 and Formula (18) to find the V-function $V_{2}$ ;
Repeat steps 2, 3;
Step $k + 1$ : assuming $(π_{k}^{0}, π_{k}^{1})$ , $V_{k}^{(π_{k}^{0}, π_{k}^{1})}$ has been obtained, at this step, do the following two steps of calculation:
- (a)
  Evolutionary calculation of $C M D P$ system: find $V_{k + 1}^{}$ by $(π_{k}^{0}, π_{k}^{1})$ and Formula (18), when $∥ V_{k + 1}^{} - V_{k}^{} ∥ < ε$ , defined $V_{k + 1}^{(π_{k}^{0}, π_{k}^{1})} (s, t) = V_{k} (s, t) ((s, t) \in S \times S)$ .
- (b)
  Greedy computing $(π_{k + 1}^{0}, π_{k + 1}^{1})$ :
  
  $\begin{matrix} π_{0, k + 1}^{} (s, t) = arg {max}_{a} Σ_{(s^{'}, a^{'}) \in S \times A}^{} P_{0} (s, t, a, s^{'}, a^{'}) [R_{0} (s, t, a, s^{'}, a^{'}) + γ \cdot V_{0}^{(π_{0 k}, π_{1 k})} (s^{'}, t)], \\ π_{1, k + 1}^{} (s, t) = arg {max}_{a} Σ_{(t^{'}, a^{'}) \in S \times A}^{} P_{1} (s, t, a, t^{'}, a^{'}) [R_{1} (s, t, a, t^{'}, a^{'}) + γ \cdot V_{0}^{(π_{0 k}, π_{1 k})} (s, t^{'})] . \end{matrix}$
If $π_{k + 1}^{0} = π_{k}^{0}$ , $π_{k + 1}^{1} = π_{k}^{1}$ , terminate calculation.
Return result.