-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A3C Example for reproducing paper results. #433
Conversation
Scores: Comparison against original reported results...
|
from chainerrl.wrappers import atari_wrappers | ||
|
||
|
||
class A3CFF(chainer.ChainList, a3c.A3CModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming this follows the specification of the A3C paper:
"The agents used the network architecture
from (Mnih et al., 2013). The network used a convolutional layer with 16 filters of size 8 × 8 with stride
4, followed by a convolutional layer with with 32 filters of size 4 × 4 with stride 2, followed by a fully
connected layer with 256 hidden units. All three hidden layers were followed by a rectifier nonlinearity. The
value-based methods had a single linear output unit for each action representing the action-value. The model
used by actor-critic agents had two set of outputs – a softmax output with one entry per action representing the
probability of selecting the action, and a single linear output representing the value function." (Source: A3C Paper)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Source: Noisy Nets paper - "In each case, we used the neural network architecture from the
corresponding original papers for both the baseline and NoisyNet variant"
parser.add_argument('--outdir', type=str, default='results', | ||
help='Directory path to save output files.' | ||
' If it does not exist, it will be created.') | ||
parser.add_argument('--t-max', type=int, default=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Source: A3C paper, Appendix 8: "All methods performed updates after every 5 actions (tmax = 5 and
IUpdate = 5) and shared RMSProp was used for optimization"
# Feature extractor | ||
return np.asarray(x, dtype=np.float32) / 255 | ||
|
||
agent = a3c.A3C(model, opt, t_max=args.t_max, gamma=0.99, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Source: A3C Paper - "All experiments
used a discount of γ = 0.99"
[model(fake_obs)], | ||
os.path.join(args.outdir, 'model')) | ||
|
||
opt = rmsprop_async.RMSpropAsync(lr=7e-4, eps=1e-1, alpha=0.99) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Source: A3C Paper - "and an RMSProp decay factor of α = 0.99"
parser.add_argument('--t-max', type=int, default=5) | ||
parser.add_argument('--beta', type=float, default=1e-2) | ||
parser.add_argument('--profile', action='store_true') | ||
parser.add_argument('--steps', type=int, default=8 * 10 ** 7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Source: Noisy nets paper - "The DQN and A3C agents were training for
200M and 320M frames, respectively".
from chainerrl.wrappers import atari_wrappers | ||
|
||
|
||
class A3CFF(chainer.ChainList, a3c.A3CModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Source: Noisy Nets paper - "In each case, we used the neural network architecture from the
corresponding original papers for both the baseline and NoisyNet variant"
parser.add_argument('--beta', type=float, default=1e-2) | ||
parser.add_argument('--profile', action='store_true') | ||
parser.add_argument('--steps', type=int, default=8 * 10 ** 7) | ||
parser.add_argument('--max-frames', type=int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Source: Noisy Networks paper - "Episodes are truncated at 108K frames (or
30 minutes of simulated play) (van Hasselt et al., 2016)." However, it's unclear in the context whether this refers to training or testing. Given the nature of other Deep RL papers, I'm assuming the truncation applies to both training and evaluation.
Other comments/questions:
|
A3C is on-policy, so I don't think they used any specific exploration strategy besides sampling from a current policy. |
Can you write so in README as well? |
Quote from the paper: "The value based methods sampled the exploration rate epsilon from a distribution taking three values epsilon 1, epsilon 2, epsilon 3 with probabilities 0.4, 0.3, 0.3. The values of epsilon 1, epsilon 2, epsilon 3 were annealed from 1 to 0.1, 0.01, 0.5 respectively over the first four million frames. Advantage actor-critic used entropy regularization with a weight β = 0.01 for all Atari and TORCS experiments". It seems you're right. The value-based baselines used the epsilon greedy exploration. |
Confirmed CI failure is caused by #463 |
No description provided.