Off-Policy Monte Carlo with Importance Sampling Off Policy Learning Link to the Notebook By exploration-exploitation trade-off, the agent should take sub-optimal exploratory action by which the agent may receive less reward. One way of exploration is by using an epsilon-greedy policy, where the agent takes a nongreedy action with a small probability.
In an on-policy, improvement and evaluation are done on the policy which is used to select actions.
In off-policy, improvement and evaluation are done on a different policy from the one used to select actions.
Deep Reinforcement Learning Agent for Navigator Environment
Policy based RL methods for Lunar lander and Cartpole environments.
TD methods like SARSA(0), SARSAMax and Expected SARSA.
MC method for BlackJack environment.