Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A3C Example for reproducing paper results. #433
A3C Example for reproducing paper results. #433
Changes from all commits
7384269
77330f2
42b990d
15292a6
1c060df
df4e3cf
3c7d5a4
d9b7854
e95f862
bd02bb7
a7483df
1a93888
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming this follows the specification of the A3C paper:
"The agents used the network architecture
from (Mnih et al., 2013). The network used a convolutional layer with 16 filters of size 8 × 8 with stride
4, followed by a convolutional layer with with 32 filters of size 4 × 4 with stride 2, followed by a fully
connected layer with 256 hidden units. All three hidden layers were followed by a rectifier nonlinearity. The
value-based methods had a single linear output unit for each action representing the action-value. The model
used by actor-critic agents had two set of outputs – a softmax output with one entry per action representing the
probability of selecting the action, and a single linear output representing the value function." (Source: A3C Paper)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Source: Noisy Nets paper - "In each case, we used the neural network architecture from the
corresponding original papers for both the baseline and NoisyNet variant"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Source: A3C paper, Appendix 8: "All methods performed updates after every 5 actions (tmax = 5 and
IUpdate = 5) and shared RMSProp was used for optimization"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Source: Noisy nets paper - "The DQN and A3C agents were training for
200M and 320M frames, respectively".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Source: Noisy Networks paper - "Episodes are truncated at 108K frames (or
30 minutes of simulated play) (van Hasselt et al., 2016)." However, it's unclear in the context whether this refers to training or testing. Given the nature of other Deep RL papers, I'm assuming the truncation applies to both training and evaluation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Source: A3C Paper - "and an RMSProp decay factor of α = 0.99"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Source: A3C Paper - "All experiments
used a discount of γ = 0.99"