Skip to main content
[Preprint]. 2024 May 22:2024.05.22.595306. [Version 1] doi: 10.1101/2024.05.22.595306

Figure 3:

Figure 3:

a. A two-neuron network that can take two actions. Updates follow Algorithm 1 without Relu nonlinearity for ease of analysis; J=1. b.. Without noise, networks randomly fixate on one action. c-d. With noise, networks choose the rewarding action at a rate dependent on activity and weight noise values. Mean plotted over 100 seeds. Shaded regions denote standard deviation. e. Sample actions chosen over 250k timesteps for ηx=0.042, ηW=0.0013 where η values define the width of noise distributions. f. Entropy of rolling 1000-timestep windows for PaN as well as an ϵ-greedy algorithm with ϵ set to 0.3 for matched reward 85%. g. Same as (c-d) but for ηx=0.042, ηW=0.0013. h. Histogram of entropy values over 100 seeds for 500k timesteps each. In this noise setting, PaN is bimodal with peaks corresponding to exploration (entropy close to log2(2) = 1) and exploitation (entropy close to 0). An ϵ-greedy agent, in contrast, maintains a consistently random exploration strategy. Appendix B shows that bimodality is not strongly dependent on window size. i. Different noise scales were tested for 100 seeds, 500k timesteps each. The mean percentage of time networks selected Action 1, the rewarding action, with standard deviations in j. See Appendix C Table 1 for values. Upper bound for compute for this figure and Figure 4 was 500 CPU hours, 55 GB for storage.