Skip to main content
. 2020 Sep 14;11(40):10959–10972. doi: 10.1039/d0sc04184j

Fig. 2. Bootstrapping process and the reinforcement learning process to train a value network. (a) The bottom-up propagation of z(s) value for bootstrapping. If the route is not from buyable precursors, the z value for all non-buyable compounds are zero (left). If the route is from buyable precursors, starting from the leaf buyable precursors (z = 1), the z value of a compound in the tree will be assigned as the average z value of the compounds' immediate precursors times a discount factor γ (0< γ < 1). If another route under the same compound generates higher z value than the current route, the z value of the compound will be updated to the larger value. Here circles denote compounds: blue circles are compounds that are not commercially available and green ones are buyable compounds. The triangles denote the templates a, through which compounds are transformed into corresponding precursors. (b) The RL process to train the value network. With the z value sampled in (b) from MCTS, a value network can be trained so that we can map vθ(s) to z(s).

Fig. 2