Table 1.
Algorithm 1: The CPG Algorithm |
---|
Function CPG |
Input: a differentiable policy parameterizations π(a|s,θ), ∀a∈A, s ∈S, θ∈Rd, C=0; |
Initialize policy parameter θ; |
Repeat forever: |
Define event A and event B; |
Generate an episode s0,a0,r1,...,sT−1,aT−1,rT, following π(a|s,θ); |
For each step of the episode t=0,...,T-1: |
G ← average future return from step t; |
C=P(A∩B)/P(A)−P(¬A∩B)/P(1−P(A)); |
θ←θ+α▽θlogπ(at|st,θ)∗G∗C; |
End for |
Return θ; |
End CPG |