Generalized Policy Iteration

Learn about Generalized Policy Iteration, Value iteration, and Policy Iteration

Renu Khandelwal


Reinforcement learning is where the learner or the decision maker, called the Agent, interacts continually with its Environment by performing actions sequentially at each discrete time step. Interaction of the Agent with its Environment changes the Environment’s state, and as a result, the Agent receives a numerical reward from the Environment.

Source: Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto

The goal of reinforcement learning is for an Agent to find an optimal policy that maximizes the long-term reward as the Environment will reward the desired actions in particular states and penalize the undesired actions in certain states.

A policy is the learning agent’s strategy to achieve the maximum reward from an Environment. The policy is the behavior of an Agent in an Environment at a given time.

The Agent uses the policy to decide what action to perform when the Environment is in a specific state. The policy is like a map for the Agent in an Environment to reach the desired goal.

Agent’s policy changes due to its experience while exploring and exploiting the Environment.

The policy may not always be the optimal route to reach the desired end state.

To find the optimal policy, you need to find the actions that yield long-term positive rewards.

A policy is optimal if its expected reward is greater than or equal to any other policy for all states

To find the expected reward for an Agent’s action when following a policy π causing the Environment to transition from state to state is performed using the Value function(V).

Value function(V) estimates how good is state-action pair for the Agent in the long run. How good is it for the agent to perform a given action in a particular state.

The Agent explores its Environment using the Value function to find how good it is to follow the current policy or whether it would be better or worse to change to a new policy.

An optimal policy can be obtained using either Generalized Policy Iteration(GPI), which includes algorithms like Value Iteration, or a Policy Iteration.



Renu Khandelwal

A Technology Enthusiast who constantly seeks out new challenges by exploring cutting-edge technologies to make the world a better place!