Skip to main content
. 2023 Feb 10;25(2):327. doi: 10.3390/e25020327
Notations used in the paper.
proportional to
xp(·) xp(x)
||x||2 Euclidian norm of x
t timestep
Const arbitrary constant
A set of possible actions
S set of possible states
aA action
sS state
s0S first state of a trajectory
sfS final state of a trajectory
sS state following a tuple (s,a)
h history of interactions (s0,a0,s1,)
s^ predicted states
gG goal
sgS state used as a goal
Sb set of states contained in b
τT trajectory
u(τ) function that extracts parts of the trajectory τ
R(s,a,s) reward function
dtπ(s) t-steps state distribution
d0:Tπ(S) stationary state-visitation distribution of π over a horizon T
1Tt=1Tdtπ(S))
f representation function
z compressed latent variable, z=f(s)
ρP density model
ϕΦ forward model
ϕTΦT true forward model
qω parameterized discriminator
π policy
πg policy conditioned on a goal g
nnk(S,s) k-th closest state to s in S
DKL(p(x)||p(x)) Kullback–Leibler divergence
Exp(·)logp(x)p(x)
H(X) Xp(x)logp(x)
H(X|S) Sp(s)Xp(x|s)logp(x|s)dxds
I(X;Y) H(X)H(X|Y)
I(X;Y|S) H(X|S)H(X|Y,S)
IG(h,A,S,S,Φ) Information gain
I(S;Φ|h,A,S)