Algorithm 1.
1: | Set B = I, θ̂ = 0, b̂ = 0, choose [πmin, πmax]. | |
2: | for t = 1, 2,… do | |
3: | Observe current context s̄t and form st,a for each a ∈ {1,…, N}. | |
4: | Randomly generate θ′ ~ 𝒩(θ̂, v2B−1). | |
5: | Let | |
|
||
6: | Compute probability πt of taking a nonzero action according to (6). | |
7: | Play action at = āt with probability πt, else play at = 0. | |
8: | Observe reward rt(at) and update θ̂ | |
|
||
end for |