Reinforcement Learning: Temporal Difference Learning

Learn the most central idea of the Reinforcement Learning algorithms

Renu Khandelwal
6 min readOct 3, 2022


Imagine you are traveling from work to home, trying to predict how long it will take to reach home. As you leave work, you consider the time of the day, traffic conditions, weather conditions, etc., to constantly update the prediction of when you will reach home. As you do this, you are using temporal difference.

Here you will learn how temporal difference learning, identified as one idea central and novel to reinforcement learning, is used for predicting online.

Reinforcement learning algorithms are based on how organisms learn from experience to anticipate future rewards correctly. The temporal difference is similar to the behavior of the dopamine neurons, where the dopamine neurons encode the difference between a reward received versus an expectation of a reward.

Good to Know:

Essential elements of Reinforcement Learning

Dynamic Programming

Generalized Policy Iteration

Reinforcement Learning: Monte Carlo Method

Reinforcement Learning: On Policy and Off Policy

Reinforcement learning is where the learner or the decision maker, called the Agent, interacts continually with its Environment by performing actions sequentially at each discrete time step. Interaction of the Agent with its Environment changes the Environment’s state, and as a result, the Agent receives a numerical reward from the Environment.

Source: Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto

The goal of reinforcement learning is for an Agent to find an optimal policy that maximizes the long-term reward.

The Policy defines the behavior of an Agent in an Environment at a given time based on Agent’s experience using Generalized Policy Iteration(GPI). GPI alternates between Policy evaluation and Policy improvement steps to find an optimal policy.

Temporal difference(TD) learning focuses on policy evaluation or prediction problems.

Temporal difference(TD) derives its name from using the time differences to predict a measure of the total amount of reward expected over the future.



Renu Khandelwal

A Technology Enthusiast who constantly seeks out new challenges by exploring cutting-edge technologies to make the world a better place!