Figure - PMC

Skip to main content

An official website of the United States government

Here's how you know

Here's how you know

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

View full-text article in PMC

[Preprint]. 2024 May 22:2024.05.22.595306. [Version 1] doi: 10.1101/2024.05.22.595306

Search in PMC
Search in PubMed
View in NLM Catalog
Add to search

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator.

PMC Copyright notice

Figure 3: — a. A two-neuron network that can take two actions. Updates follow Algorithm 1 without Relu nonlinearity for ease of analysis; J=1. b.. Without noise, networks randomly fixate on one action. c-d. With noise, networks choose the rewarding action at a rate dependent on activity and weight noise values. Mean plotted over 100 seeds. Shaded regions denote standard deviation. e. Sample actions chosen over 250k timesteps for $η_{x} = 0.042$ , $η_{W} = 0.0013$ where $η$ values define the width of noise distributions. f. Entropy of rolling 1000-timestep windows for PaN as well as an $ϵ$ -greedy algorithm with $ϵ$ set to 0.3 for matched reward 85%. g. Same as (c-d) but for $η_{x} = 0.042$ , $η_{W} = 0.0013$ . h. Histogram of entropy values over 100 seeds for 500k timesteps each. In this noise setting, PaN is bimodal with peaks corresponding to exploration (entropy close to log₂(2) = 1) and exploitation (entropy close to 0). An $ϵ$ -greedy agent, in contrast, maintains a consistently random exploration strategy. Appendix B shows that bimodality is not strongly dependent on window size. i. Different noise scales were tested for 100 seeds, 500k timesteps each. The mean percentage of time networks selected Action 1, the rewarding action, with standard deviations in j. See Appendix C Table 1 for values. Upper bound for compute for this figure and Figure 4 was 500 CPU hours, 55 GB for storage.