Reinforcement Learning: On Policy and Off Policy
An intuitive explanation of the terms used for On Policy and Off Policy, along with their differences
--
The explanation used in this article is to just simplify the concepts for understanding purpose.
You just moved to a new locality and have tried a few restaurants in your area. Today you are going out to eat again at a restaurant.
We are transforming the problem of selecting the best restaurant to eat at into Reinforcement learning.
You, the Agent, or the decision maker, are constantly trying to find the best restaurant experience in your area, referred to as an Environment, taking action by visiting restaurants at different time steps. Based on what restaurant you visit, the Environment or the restaurant changes the state, which is the restaurant experience. As a result, you receive a numerical reward from the Environment regarding a good or bad experience.
The sole objective of the Agent, which is you, is to maximize the total reward to get the best restaurant experience in your area over the long run.
A policy is a strategy an agent deploys in pursuit of a goal.
The Policy dictates an agent's actions in an environment to maximize its long-term reward.
A policy is optimal if its expected reward is greater than or equal to any other policy for all states.
In the case of finding the best restaurant experience in your area, the actions you take generate your Policy or strategy.
Agent’s policy changes due to its experience while exploring and exploiting the Environment.
You can decide to go to your favorite restaurant based on your experience or try a new restaurant. When you choose your favorite restaurant based on history, you have exploited the best available information. In contrast, when you try to be adventurous to try a new restaurant and gather…