-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot reproduce DQN Breakout baseline #49
Comments
Hello, PS: please also fill the issue template completely (notably the packages version and os) |
For reference, here is the current result of the trained agent using 5000 test steps:
As you can see the mean reward is around 10, but the corresponding score is much higher EDIT: I reactivated the episode reward print to generate this |
More explanation: openai/baselines#667 EDIT: If you look at the learning curve here the current DQN trained agent matches the previous performance (mean score around 200) |
I'm training from scratch using current code available in
So score mostly hangs around in 80, occasionally hitting 239. However more concerning thing is that there is no aparent convergence as shown by Tensorboard graph. For sanity check, I also trained model for Pong which shows good convergence: I'm suspecting something is broken and train.py is no longer generating model that would reproduce high scores for Breakout. I've also tested OpenAI baselines and the monitor.csv which is supposed to have raw score is also similarly stuck in low 30s. |
Digging more, the average of episode scores is 113.8 from enjoy.py which is still not out of line from 131.4 from OpenAI and 123 from RLLib. So the question that remains: Is above training curve expected? It's nothing like Pong as I posted above. Even if you do smoothing, there is not much a pattern of convergence in that curve. |
As mentioned before, the episode reward does not represent the score of the agent (you can take a look at the wrappers here). To monitor it in tensorboard, you will just have to follow the doc, as you mentioned a call to Again, monitoring the training reward is only a proxy to the true performance (more on that below ;) )
First, I recommend you reading How many random seed should I use? by @ccolas and RL that matters I assume you trained using only one random seed? Then you have to know that you are training a Double Dueling Deep Q Network with Prioritized Experience Replay (and not a vanilla DQN). |
The results were reproduced recently by Anssi using both SB2 and SB3 code: DLR-RM/stable-baselines3#110 (review) He disabled all extensions for that (no PER, no double/dueling q learning). |
I'm excited by the "stable" promise of stable-baseline but currently, I'm not able to reproduce DQN results for Breakout. It is well known that you should get 300+ score in Breakout with DQN and this can be confirmed by monitor.csv in benchmark.zip in this repo. Coincidently, OpenAI Baseline is also broken for DQN/Breakout. I'm suspecting their bug has also impacted stable-baselines.
Here are my results:
Tensotboard curve:
Last 3 stdout log:
As we can see training does not converge and reward stay stuck at 8.5 and sometimes randomly picking at upto 22, still well below expected 300+.
The text was updated successfully, but these errors were encountered: