Reinforcement Learning: On Policy and Off Policy

An intuitive explanation of the terms used for On Policy and Off Policy, along with their differences

Renu Khandelwal
6 min readSep 29, 2022
Image by author

The explanation used in this article is to just simplify the concepts for understanding purpose.

You just moved to a new locality and have tried a few restaurants in your area. Today you are going out to eat again at a restaurant.

We are transforming the problem of selecting the best restaurant to eat at into Reinforcement learning.

You, the Agent, or the decision maker, are constantly trying to find the best restaurant experience in your area, referred to as an Environment, taking action by visiting restaurants at different time steps. Based on what restaurant you visit, the Environment or the restaurant changes the state, which is the restaurant experience. As a result, you receive a numerical reward from the Environment regarding a good or bad experience.

The sole objective of the Agent, which is you, is to maximize the total reward to get the best restaurant experience in your area over the long run.

--

--

Renu Khandelwal

A Technology Enthusiast who constantly seeks out new challenges by exploring cutting-edge technologies to make the world a better place!