Off-Policy Monte Carlo with Importance Sampling Off Policy Learning Link to the Notebook By exploration-exploitation trade-off, the agent should take sub-optimal exploratory action by which the agent may receive less reward. One way of exploration is by using an epsilon-greedy policy, where the agent takes a nongreedy action with a small probability.
In an on-policy, improvement and evaluation are done on the policy which is used to select actions.
In off-policy, improvement and evaluation are done on a different policy from the one used to select actions.