Reinforcement Learning: Monte Carlo Method
An easy-to-understand explanation of the Monte-Carlo method for Reinforcement learning
In Reinforcement learning, the learner or the decision maker, called the Agent, constantly interacts with its Environment by performing actions sequentially at each discrete time step. Interaction of the Agent with its Environment changes the Environment's state, and as a result, the Agent receives a numerical reward from the Environment.
The sole objective of the Agent is to maximize the total reward it receives over the long run.
The Agent generates a sequence or trajectories of state, action, and reward over a period of time.
A probability distribution, P(s`| s, a) represents the probability of passing from one state(s) to another(s`) when taking action a. The transition probability T(s, a, s`) specifies the probability of ending up in state "s`" when taking action "a" in the state "s".
Model-based approaches like the Markov Decision Process(MDP) use the model of Environment. The model represents Environment dynamics with state transition and reward functions.
If you want to predict the weather or the price of a stock, these predictions depend on a variety of environmental factors or market/economic/sentimental factors. You don't know the transition probabilities of the future.
How can the Agent learn when the model of Environmentment or the transition probabilities are unavailable?
When a model is unavailable, or there is no idea of how the system will evolve, then the only way for the Agent to learn is by interacting with Environment.
Monte Carlo does not assume complete knowledge of the Environment and learns based on experience by interacting Environment. It does not know states ahead of time, hence to find what actions yield the best rewards in the long run, Monte Carlo explicitly estimates the value of each action to suggest an optimal policy.
The primary goal of the Monte Carlo method is to estimate q*(s, a), which is optimal quality function defining the quality of being in state s for an action a taken by an Agent to suggest an optimal policy, π* based on sampling