Cannot reproduce DQN Breakout baseline #49

sytelus · 2019-11-01T23:30:09Z

I'm excited by the "stable" promise of stable-baseline but currently, I'm not able to reproduce DQN results for Breakout. It is well known that you should get 300+ score in Breakout with DQN and this can be confirmed by monitor.csv in benchmark.zip in this repo. Coincidently, OpenAI Baseline is also broken for DQN/Breakout. I'm suspecting their bug has also impacted stable-baselines.

Here are my results:

python train.py --algo dqn --env BreakoutNoFrameskip-v4

Tensotboard curve:

Last 3 stdout log:

--------------------------------------
| % time spent exploring  | 1        |
| episodes                | 56300    |
| mean 100 episode reward | 8.9      |
| steps                   | 9940866  |
--------------------------------------
--------------------------------------
| % time spent exploring  | 1        |
| episodes                | 56400    |
| mean 100 episode reward | 8.4      |
| steps                   | 9967305  |
--------------------------------------
--------------------------------------
| % time spent exploring  | 1        |
| episodes                | 56500    |
| mean 100 episode reward | 8.5      |
| steps                   | 9993810  |
--------------------------------------

As we can see training does not converge and reward stay stuck at 8.5 and sometimes randomly picking at upto 22, still well below expected 300+.

The text was updated successfully, but these errors were encountered:

araffin · 2019-11-01T23:37:24Z

Hello,
Because of Atari preprocessing, the reported reward is not the real one, did you evaluate it using the enjoy.py script? (cf here)

PS: please also fill the issue template completely (notably the packages version and os)

araffin · 2019-11-01T23:42:41Z

For reference, here is the current result of the trained agent using 5000 test steps:

python enjoy.py --algo dqn --env BreakoutNoFrameskip-v4 --no-render -n 5000
Using Atari wrapper
Episode Reward: 4.00
Episode Length 154
Episode Reward: 6.00
Episode Length 203
Episode Reward: 10.00
Episode Length 269
Episode Reward: 13.00
Episode Length 331
Atari Episode Score: 65.00
Atari Episode Length 1051
Episode Reward: 2.00
Episode Length 68
Episode Reward: 41.00
Episode Length 1045
Episode Reward: 2.00
Episode Length 72
Episode Reward: 40.00
Episode Length 575
Episode Reward: 0.00
Episode Length 18
Atari Episode Score: 308.00
Atari Episode Length 1759
Episode Reward: 0.00
Episode Length 20
Episode Reward: 41.00
Episode Length 1045
Episode Reward: 2.00
Episode Length 72
Episode Reward: 40.00
Episode Length 575
Episode Reward: 0.00
Episode Length 18
Atari Episode Score: 308.00
Atari Episode Length 1759
Episode Reward: 0.00
Episode Length 20
Mean reward: 13.40

As you can see the mean reward is around 10, but the corresponding score is much higher

EDIT: I reactivated the episode reward print to generate this

araffin · 2019-11-01T23:47:02Z

More explanation: openai/baselines#667

EDIT: If you look at the learning curve here the current DQN trained agent matches the previous performance (mean score around 200)

sytelus · 2019-11-02T02:48:49Z

I'm training from scratch using current code available in stable-baseline as well as rl-baselines-zoo. With enjoy.py I get this:

Atari Episode Score: 88.00
Atari Episode Length 1339
Atari Episode Score: 26.00
Atari Episode Length 903
Atari Episode Score: 239.00
Atari Episode Length 1408
Atari Episode Score: 88.00
Atari Episode Length 1339
Atari Episode Score: 88.00
Atari Episode Length 1339
Atari Episode Score: 239.00
Atari Episode Length 1408
Atari Episode Score: 88.00
Atari Episode Length 1339
Atari Episode Score: 26.00
Atari Episode Length 903
Atari Episode Score: 26.00
Atari Episode Length 903
Atari Episode Score: 239.00
Atari Episode Length 1408
Atari Episode Score: 80.00
Atari Episode Length 1269
Atari Episode Score: 80.00
Atari Episode Length 1269
Atari Episode Score: 26.00
Atari Episode Length 903
Atari Episode Score: 80.00
Atari Episode Length 1269
Atari Episode Score: 239.00
Atari Episode Length 1408
Atari Episode Score: 88.00
Atari Episode Length 1339
Atari Episode Score: 80.00
Atari Episode Length 1269
Atari Episode Score: 80.00
Atari Episode Length 1269
Atari Episode Score: 239.00
Atari Episode Length 1408
Atari Episode Score: 80.00
Atari Episode Length 1269
Atari Episode Score: 80.00
Atari Episode Length 1269
Atari Episode Score: 80.00
Atari Episode Length 1269
Atari Episode Score: 239.00
Atari Episode Length 1408

So score mostly hangs around in 80, occasionally hitting 239. However more concerning thing is that there is no aparent convergence as shown by Tensorboard graph. For sanity check, I also trained model for Pong which shows good convergence:

I'm suspecting something is broken and train.py is no longer generating model that would reproduce high scores for Breakout. I've also tested OpenAI baselines and the monitor.csv which is supposed to have raw score is also similarly stuck in low 30s.

sytelus · 2019-11-02T04:19:00Z

Digging more, the average of episode scores is 113.8 from enjoy.py which is still not out of line from 131.4 from OpenAI and 123 from RLLib.

So the question that remains: Is above training curve expected? It's nothing like Pong as I posted above. Even if you do smoothing, there is not much a pattern of convergence in that curve.

araffin · 2019-11-02T11:55:22Z

However more concerning thing is that there is no aparent convergence as shown by Tensorboard graph.
So the question that remains: Is above training curve expected?

As mentioned before, the episode reward does not represent the score of the agent (you can take a look at the wrappers here).
One episode correspond to one life, so in breakout, compared to pong, you will lose more often a life, especially when you are taking random actions (it stays epsilon greedy during training), that's why the learning curve is much more chaotic.
If you want to monitor the true training reward, you will have to modify a bit the DQN code to include information from the Monitor wrapper (as it is done here for SAC). We would appreciate a PR if you do so ;)

To monitor it in tensorboard, you will just have to follow the doc, as you mentioned a call to logger.configure() is missing.

Again, monitoring the training reward is only a proxy to the true performance (more on that below ;) )

Digging more, the average of episode scores is 113.8 from enjoy.py which is still not out of line from 131.4 from OpenAI and 123 from RLLib.

First, I recommend you reading How many random seed should I use? by @ccolas and RL that matters

I assume you trained using only one random seed?
Then how many test episodes/test steps did you use? What is the variance of the results?
If you look at the trained agent in the repo, it has a mean test score of 191 but with a variance of 91 over 150k test steps.

Then you have to know that you are training a Double Dueling Deep Q Network with Prioritized Experience Replay (and not a vanilla DQN).

araffin · 2020-08-03T20:06:39Z

The results were reproduced recently by Anssi using both SB2 and SB3 code: DLR-RM/stable-baselines3#110 (review)

He disabled all extensions for that (no PER, no double/dueling q learning).

sytelus mentioned this issue Nov 2, 2019

Cannot reproduce the benchmark results of DQN on Breakout openai/baselines#672

Closed

araffin closed this as completed Aug 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot reproduce DQN Breakout baseline #49

Cannot reproduce DQN Breakout baseline #49

sytelus commented Nov 1, 2019 •

edited

Loading

araffin commented Nov 1, 2019 •

edited

Loading

araffin commented Nov 1, 2019 •

edited

Loading

araffin commented Nov 1, 2019 •

edited

Loading

sytelus commented Nov 2, 2019

sytelus commented Nov 2, 2019 •

edited

Loading

araffin commented Nov 2, 2019

araffin commented Aug 3, 2020

Cannot reproduce DQN Breakout baseline #49

Cannot reproduce DQN Breakout baseline #49

Comments

sytelus commented Nov 1, 2019 • edited Loading

araffin commented Nov 1, 2019 • edited Loading

araffin commented Nov 1, 2019 • edited Loading

araffin commented Nov 1, 2019 • edited Loading

sytelus commented Nov 2, 2019

sytelus commented Nov 2, 2019 • edited Loading

araffin commented Nov 2, 2019

araffin commented Aug 3, 2020

sytelus commented Nov 1, 2019 •

edited

Loading

araffin commented Nov 1, 2019 •

edited

Loading

araffin commented Nov 1, 2019 •

edited

Loading

araffin commented Nov 1, 2019 •

edited

Loading

sytelus commented Nov 2, 2019 •

edited

Loading