. Author manuscript; available in PMC: 2018 Jan 1.

Published in final edited form as: Adv Neural Inf Process Syst. 2017 Dec;30:5973–5981.

Algorithm 1.

Action-Centered Thompson Sampling

Set B = I, θ̂ = 0, b̂ = 0, choose [π_min, π_max].

for t = 1, 2,… do

Observe current context s̄_t and form s_t,a for each a ∈ {1,…, N}.

Randomly generate θ′ ~ 𝒩(θ̂, v²B⁻¹).

Let

{\bar{a}}_{t} = arg max_{a \in {1, \dots, N}} s_{t, a}^{T} θ^{'} .

Compute probability π_t of taking a nonzero action according to (6).

Play action a_t = ā_t with probability π_t, else play a_t = 0.

Observe reward r_t(a_t) and update θ̂

B = B + π_{t} (1 - π_{t}) s_{t, {\bar{a}}_{t}} s_{t, {\bar{a}}_{t}}^{T}, \hat{b} = \hat{b} + s_{t, {\bar{a}}_{t}} (I (a_{t} > 0) - π_{t}) r_{t} (a_{t}), \hat{θ} = B^{- 1} \hat{b} .

end for