Skip to main content
. 2020 Aug 30;22(9):955. doi: 10.3390/e22090955
Algorithm 1 The Optimal Strategy Pairs
Input: input CMDP quintuple <S,A,{Agent0,Agent1},{T0,T1},{R0,R1}>: error parameter, ε>0 balance parameter 0<α<1, discount factor 0<γ<1.
Output: output Optimal strategy pair (π*0,π*1)=(πk0,πk1), optimal value function Vk+1=Vk+1(π*0,π*1).
  1. Initialization V-function V10,V11, take the initial value randomly. Such as V11(s)=0(sS); suppose V1=αV10+(1α)V11;

  2. Use Equation (19) to greedy improvement strategy pair (π10,π11): (V1(π10,π11)=V1);

  3. Use the updated strategy pair (π10,π11) in step 2 and Formula (18) to find the V-function V2;

  4. Repeat steps 2, 3;

  5. Step k+1: assuming (πk0,πk1), Vk(πk0,πk1) has been obtained, at this step, do the following two steps of calculation:
    • (a)
      Evolutionary calculation of CMDP system: find Vk+1 by (πk0,πk1) and Formula (18), when Vk+1Vk<ε, defined Vk+1(πk0,πk1)(s,t)=Vk(s,t)((s,t)S×S).
    • (b)
      Greedy computing (πk+10,πk+11):
      π0,k+1(s,t)=argmaxaΣ(s,a)S×AP0(s,t,a,s,a)[R0(s,t,a,s,a)+γ·V0(π0k,π1k)(s,t)],π1,k+1(s,t)=argmaxaΣ(t,a)S×AP1(s,t,a,t,a)[R1(s,t,a,t,a)+γ·V0(π0k,π1k)(s,t)].
  6. If πk+10=πk0, πk+11=πk1, terminate calculation.

  7. Return result.