Reinforcement Learning: Temporal Difference Learning
Imagine you are traveling from work to home, trying to predict how long it will take to reach home. As you leave work, you consider the time of the day, traffic conditions, weather conditions, etc., to constantly update the prediction of when you will reach home. As you do this, you are using temporal difference.
Here you will learn how temporal difference learning, identified as one idea central and novel to reinforcement learning, is used for predicting online.
Reinforcement learning algorithms are based on how organisms learn from experience to anticipate future rewards correctly. The temporal difference is similar to the behavior of the dopamine neurons, where the dopamine neurons encode the difference between a reward received versus an expectation of a reward.
Good to Know:
Reinforcement learning is where the learner or the decision maker, called the Agent, interacts continually with its Environment by performing actions sequentially at each discrete time step. Interaction of the Agent with its Environment changes the Environment’s state, and as a result, the Agent receives a numerical reward from the Environment.
The goal of reinforcement learning is for an Agent to find an optimal policy that maximizes the long-term reward.
The Policy defines the behavior of an Agent in an Environment at a given time based on Agent’s experience using Generalized Policy Iteration(GPI). GPI alternates between Policy evaluation and Policy improvement steps to find an optimal policy.
Temporal difference(TD) learning focuses on policy evaluation or prediction problems.
Temporal difference(TD) derives its name from using the time differences to predict a measure of the total amount of reward expected over the future.