Skip to main content
. 2019 Apr 9;19(Suppl 2):60. doi: 10.1186/s12911-019-0755-6

Table 1.

The Causal Policy Gradient (CPG) Algorithm

Algorithm 1: The CPG Algorithm
Function CPG
 Input: a differentiable policy parameterizations π(a|s,θ), ∀a∈A, s ∈S, θRd, C=0;
 Initialize policy parameter θ;
 Repeat forever:
  Define event A and event B;
  Generate an episode s0,a0,r1,...,sT−1,aT−1,rT, following π(a|s,θ);
  For each step of the episode t=0,...,T-1:
   G ← average future return from step t;
   C=P(AB)/P(A)−P(¬AB)/P(1−P(A));
   θθ+αθlogπ(at|st,θ)∗GC;
 End for
 Return θ;
End CPG