Action-value function estimation

Similar to how TD(0) and Monte Carlo value estimation could be applied to the case of function approximation, we can apply nn-step SARSA and Q-learning to the gradient case.

Our goal is to learn a parametric approximation q^(s,aθ)q(s,a)\hat{q}(s, a | \theta) \approx q_*(s, a) for on-policy control. Instead of performing a gradient step moving StS_t towards some target UtU_t, we now turn to performing a gradient stop on St,AtS_t, A_t towards a target UtU_t. Hence, the gradient update step is of the form:

θt+1θt+α[Utq^(St,Atθt)]q^(St,Atθt) \theta_{t+1} \gets \theta_t + \alpha \left[ U_t - \hat{q}(S_t, A_t | \theta_t) \right] \nabla \hat{q}(S_t, A_t | \theta_t)

Examples

In the case of (one-step) SARSA, this update is:

θt+1θt+α[Rt+1+γq^(St+1,At+1θt)q^(St,Atθt)]q^(St,Atθt) \theta_{t+1} \gets \theta_t + \alpha \left[ R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1} | \theta_t) - \hat{q}(S_t, A_t | \theta_t) \right] \nabla \hat{q}(S_t, A_t | \theta_t)

This algorithm is known as episodic semi-gradient one-step SARSA, and it would have the same convergence properties as TD(0) if the policy were constant.

And of course, there is also an analogue for nn-step SARSA as well:

Gt(n)Rt+1+γRt+2+...+γn1Rt+n+γnq^(St+n,At+n,θt+n1) G_t^{(n)} \gets R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1}R_{t+n} + \gamma^n \hat{q}(S_{t+n}, A_{t+n}, \theta_{t+n-1})

θt+nθt+n1+α[Gt(n)q^(St,Atθt+n1)]q^(St,Atθt+n1) \theta_{t+n} \gets \theta_{t+n-1} + \alpha \left[G_t^{(n)} - \hat{q}(S_t, A_t | \theta_{t+n-1}) \right] \nabla \hat{q}(S_t, A_t | \theta_{t+n-1})

As well as semi-gradient Q-learning:

θt+1θt+α[Rt+1+γmaxaq^(St+1,aθt)]q^(St,Atθt) \theta_{t+1} \gets \theta_{t} + \alpha \left[ R_{t+1} + \gamma \max_{a'} \hat{q}(S_{t+1}, a' | \theta_t ) \right] \nabla \hat{q}(S_t, A_t | \theta_t)

Any of these methods may be used with any gradient-based optimizer, such as RMSProp, Adam, or vanilla mini batch stochastic gradient descent.

Application to control problems

To apply these gradient action-value methods to control problems, simply combine them with a suitable exploration policy. For example, a commonly used exploration policy is an ϵ\epsilon-greedy policy with ϵ\epsilon annealed linearly from a high value to a low value.

Remaining issues

We should still be wary about these function approximation methods for the reasons outlined in the “Challenges of RL section”: training will be unstable. This is due to a non stationary distribution of experiences and temporal correlations as outlined in the chapter “Challenges of Deep RL”.

Fortunately, there are many solutions to the training instability problem for control. The first successful application of learning from high-dimensional state spaces is the deep Q-network, which learned policies from raw pixel values that beat human experts at 49 Atari games.

results matching ""

    No results matching ""