Table 3.
Policy | Avg. Reward/Turn | WIS Score | DR estim./turn | DR Score |
---|---|---|---|---|
RL (T = 35) | 11.68 (2.06–21.35) | 408.29 (72.41–744.17) | 13.10 (12.91–13.35) | 458.64 (452.40–464.87) |
Expert Policy | 2.62 (−7.28–12.51) | 91.71 (−255.12 − 438.68) | 10.82 (10.51–11.14) | 379.07 (367.89–390.25) |
Advantage | 8.68 (7.16–10.13) | 302.67 (250.78–354.58) | — | — |
Weighted Importance Sampling (WIS) indicates off-policy evaluation of a given policy while sampling trajectories from the original dataset corpus25. For the expert policy, no importance weights are needed, and the cumulative rewards are used over entire conversational episodes. For the AI agent, a cut-off of 35 turns is again used to bound the length of off-policy trajectories. Average reward per turn is used to assess the average expected reward for the agent based on the reward function used to train the RL agent.