You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It is difficult to use an MDP agent on a Bandit task, mainly because of the eligibility trace update.
On a contextual 2 armed bandit task, the final action is $\mathbf u' = (0.5, 0.5)^\top$. The 0.5's are necessary in order to facilitate computation of the target $y_t = r_t - \mathbf u'^\top \mathbf Q \mathbf x'$ such that
However, the eligibility trace is updated as
which in a 4 state (2 context, 2 outcome) task with $\lambda = \gamma = 1$, and where $\mathbf x = (1, 0, 0, 0)^\top$, $\mathbf u = (1, 0)^\top$ and $\mathbf x' = (0, 0, 1, 0)^\top$, should result in a trace that looks like
The current setup will allow either the correct trace or the correct target calculation.
I think the solution may be to separate the trace updating function from the value function updating.
The text was updated successfully, but these errors were encountered:
It is difficult to use an MDP agent on a Bandit task, mainly because of the eligibility trace update.
On a contextual 2 armed bandit task, the final action is$\mathbf u' = (0.5, 0.5)^\top$ . The 0.5's are necessary in order to facilitate computation of the target $y_t = r_t - \mathbf u'^\top \mathbf Q \mathbf x'$ such that
However, the eligibility trace is updated as
which in a 4 state (2 context, 2 outcome) task with$\lambda = \gamma = 1$ , and where $\mathbf x = (1, 0, 0, 0)^\top$ , $\mathbf u = (1, 0)^\top$ and $\mathbf x' = (0, 0, 1, 0)^\top$ , should result in a trace that looks like
The current setup will allow either the correct trace or the correct target calculation.
I think the solution may be to separate the trace updating function from the value function updating.
The text was updated successfully, but these errors were encountered: