Action-value function estimation
Similar to how TD(0) and Monte Carlo value estimation could be applied to the case of function approximation, we can apply -step SARSA and Q-learning to the gradient case.
Our goal is to learn a parametric approximation for on-policy control. Instead of performing a gradient step moving towards some target , we now turn to performing a gradient stop on towards a target . Hence, the gradient update step is of the form:
Examples
In the case of (one-step) SARSA, this update is:
This algorithm is known as episodic semi-gradient one-step SARSA, and it would have the same convergence properties as TD(0) if the policy were constant.
And of course, there is also an analogue for -step SARSA as well:
As well as semi-gradient Q-learning:
Any of these methods may be used with any gradient-based optimizer, such as RMSProp, Adam, or vanilla mini batch stochastic gradient descent.
Application to control problems
To apply these gradient action-value methods to control problems, simply combine them with a suitable exploration policy. For example, a commonly used exploration policy is an -greedy policy with annealed linearly from a high value to a low value.
Remaining issues
We should still be wary about these function approximation methods for the reasons outlined in the “Challenges of RL section”: training will be unstable. This is due to a non stationary distribution of experiences and temporal correlations as outlined in the chapter “Challenges of Deep RL”.
Fortunately, there are many solutions to the training instability problem for control. The first successful application of learning from high-dimensional state spaces is the deep Q-network, which learned policies from raw pixel values that beat human experts at 49 Atari games.