Reinforcement Learning has evolved considerably. RL has helped to defeat world-champion players of the game. RL has also managed robotic hands. With the help of RL, pictures have even been painted.
But still, Reinforcement Learning is suffering from the problem of value estimation. What is value estimation in the case of RL? It is understanding the long-term impact of staying in a particular state. The learning can be complicated as the prospective returns are normally noisy. Value estimation is important for noise reduction.
The natural method for value estimation of a state is by computing the average returns. This method is also called the Monte Carlo value estimation.
The Monte Carlo Value Estimation
According to the definition, the nth Monte Carlo update is V (St) = V (St-1) + 1/n [Rn – V(st)]
- V(St) ↩ Rt
Where V (St) is the value of the state and V (St-1) is the value of the previous state. Rn is the nth return. Now, the right-hand side term is the return of the particular state selected for this. Here, the return is a weighted total of future rewards rt + γrt+1+γ2rt+2+…
Where γ is a discount factor that displays a comparison between short-term rewards and long-term rewards.
Beating the Monte Carlo
However, the Monte Carlo method can be improved further. Here a technique called Temporal Difference (TD) learning is used. It helps in bootstrapping off of closeby states for condensing value updates in one equation.
V(St) ↩ rt + γV(st+1)
In contrast to the Monte Carlo update, the TD updates combine intersections in such a way that the return flow goes backwards to its previous states.
V(st+1) can be expressed as the expectation over all of its TD updates:
V(st+1) ≃ E[rt+1′ + γV(st+2′)]
≃ E[rt+1′] + γE[V(st+2′)]
We can influence this equation to increase the TD update rule in a self-repeating way:
V (st+1) ↩ rt + γV(st+1) _ +
↩ rt + γE[rt+1] + γ2E[V(st+2′′)]
↩ rt + γE [rt+1′] + γ2EE [rt+2′′] + …
It is a sum of nested expectation values, which is rather odd. Looking at it, it’s not fully understandable how to contrast them with a rather simplistic-looking MC update. Moreover, it is not yet clear whether to compare TD and MC, or not.
However, they are not as far apart as we assume. Therefore, it is important to express the MC update with regard to reward and position it apart from the expanded TD update.
- MC update
V(st) ↩ rt (Reward from current path) + γ rt+1 (Reward from current path) + γ2 rt+2 (reward from current path) + …
- TD update
V(st) ↩ rt (Reward from current path) + γE [rt+1′] (Expectation over paths increasing current path.)
The variation between MC update and TD learning has been rendered to the nested expectation operators. This is what is called the paths perspective on value learning.
The Paths Perspective
To describe the path's perspective using an agent’s experience to be a series of parabolic trajectories is a common thing. The combination is logical and easy to visualise, but this method of organisation of experience does not clearly show the relationships amongst the different trajectories. We come across value estimations here.
Value Estimations – MC is averaging over real trajectories while learning is averaging over all potential paths. The nested expectation values that we had seen previously correspond to the agent averaging across all possible future paths.
Contrasting the two – Usually, the ideal value estimate is the trajectory with the least variance. As tabular TD and MC are experimental averages, the suitable strategy is the one that provides the better estimation and that averages over more items. But which estimator averages over more items? This can be calculated by:
V ar[V(s)] (Variance of estimate) ∝ 1/N (Inverse of the number of items in the average)
This explains why TD has a tendency to outperform MC within tabular environments.
Effects of Paths Perspective for Deep Reinforcement Learning
Deep Neural Networks are the most extensive function approximators within RL. Although in the initial training of the model in neural networks, the approximators have a tendency to merge the incorrect experience paths like in the Cliff Walking experiment, an untrained neural network could take up the bad value updates as the Euclidean averager.
But as training moves forward, a completely-trained neural network will have learned the value updates to states above the barrier. So it will not repeat the same errors. This is something that most other function approximators can’t do.
Final Words
Hopefully, this write-up has shed light on the path's perspective of value learning and its effects helping you to get a better understanding of the concept. And if you want to learn more, you can always visit the blog section of the E2E Networks website.
Reference Links
https://distill.pub/2019/paths-perspective-on-value-learning/
https://greydanus.github.io/2020/01/27/paths-perspective/
https://arxiv.org/abs/2101.10552