Skip to main content
. Author manuscript; available in PMC: 2018 Jan 1.
Published in final edited form as: Adv Neural Inf Process Syst. 2017 Dec;30:5973–5981.

Algorithm 1.

Action-Centered Thompson Sampling

1: Set B = I, θ̂ = 0, = 0, choose [πmin, πmax].
2: for t = 1, 2,do
3:  Observe current context t and form st,a for each a ∈ {1,, N}.
4:  Randomly generate θ′ ~ 𝒩(θ̂, v2B−1).
5:  Let
a¯t=argmaxa{1,,N}st,aTθ.
6:  Compute probability πt of taking a nonzero action according to (6).
7:  Play action at = āt with probability πt, else play at = 0.
8:  Observe reward rt(at) and update θ̂
B=B+πt(1-πt)st,a¯tst,a¯tT,b^=b^+st,a¯t(I(at>0)-πt)rt(at),θ^=B-1b^.
end for