MDP Agents on Bandit Tasks #101

abrahamnunes · 2018-07-04T13:22:12Z

It is difficult to use an MDP agent on a Bandit task, mainly because of the eligibility trace update.

On a contextual 2 armed bandit task, the final action is $\mathbf u' = (0.5, 0.5)^\top$. The 0.5's are necessary in order to facilitate computation of the target $y_t = r_t - \mathbf u'^\top \mathbf Q \mathbf x'$ such that

However, the eligibility trace is updated as

which in a 4 state (2 context, 2 outcome) task with $\lambda = \gamma = 1$, and where $\mathbf x = (1, 0, 0, 0)^\top$, $\mathbf u = (1, 0)^\top$ and $\mathbf x' = (0, 0, 1, 0)^\top$, should result in a trace that looks like

The current setup will allow either the correct trace or the correct target calculation.

I think the solution may be to separate the trace updating function from the value function updating.

ARudiuk · 2018-07-04T14:56:43Z

Some of the math seems to not be rendering @abrahamnunes

hardik44fg · 2020-09-30T18:37:05Z

@abrahamnunes Try to highlight the important words so it will help someone to easily understand

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MDP Agents on Bandit Tasks #101

MDP Agents on Bandit Tasks #101

abrahamnunes commented Jul 4, 2018

ARudiuk commented Jul 4, 2018

hardik44fg commented Sep 30, 2020

MDP Agents on Bandit Tasks #101

MDP Agents on Bandit Tasks #101

Comments

abrahamnunes commented Jul 4, 2018

ARudiuk commented Jul 4, 2018

hardik44fg commented Sep 30, 2020