New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

A3C Example for reproducing paper results. #433

Merged

muupan merged 12 commits into chainer:master from prabhatnagarajan:a3c_example

Jun 5, 2019

Contributor

prabhatnagarajan commented Apr 3, 2019

No description provided.

prabhatnagarajan added 5 commits

April 3, 2019 09:55


          creates a3c training file and copies IQN readme


          modifies a3c readme

77330f2


          edits A3C example readme

42b990d


          Merge branch 'master' into a3c_example

15292a6


          changes to time-based evaluation

1c060df

Contributor Author

prabhatnagarajan commented May 20, 2019 •

edited

Loading

Scores:

Comparison against original reported results...

Game	ChainerRL Score	Original Reported Scores
AirRaid	4625.9	N/A
Alien	1397.2	2027
Amidar	1110.8	904
Assault	5821.6	2879
Asterix	6820.7	6822
Asteroids	2428.8	2544
Atlantis	732425.0	422700
BankHeist	1308.9	1296
BattleZone	5421.1	16411
BeamRider	8493.4	9214
Berzerk	1594.2	1022
Bowling	31.7	37
Boxing	98.1	91
Breakout	533.6	496
Carnival	5132.9	N/A
Centipede	4849.9	5350
ChopperCommand	4881.0	5285
CrazyClimber	124400.0	134783
Defender	N/A	52917.0
DemonAttack	108832.5	37085
DoubleDunk	1.5	3
Enduro	0.0	0
FishingDerby	36.3	-7
Freeway	0.0	0
Frostbite	313.6	288
Gopher	8746.5	7992
Gravitar	228.0	379
Hero	36892.5	30791
IceHockey	-4.6	-2
JamesBond	N/A	509.0
Jamesbond	370.1	N/A
JourneyEscape	-871.2	N/A
Kangaroo	115.8	1166
Krull	10601.4	9422
KungFuMaster	40970.4	37422
MontezumaRevenge	1.9	14
MsPacman	2498.0	2436
NameThisGame	6597.0	7168
Phoenix	42654.5	9476
Pitfall	-10.8	N/A
Pitfall!	N/A	0.0
Pong	20.9	7
Pooyan	4067.9	N/A
PrivateEye	376.1	3781
Qbert	15610.6	18586
Riverraid	13223.3	N/A
RoadRunner	39897.8	45315
Robotank	2.9	6
Seaquest	1786.5	1744
Skiing	-16090.5	-12972
Solaris	3157.8	12380
SpaceInvaders	1630.6	1034
StarGunner	57943.2	49156
Surround	N/A	-8.0
Tennis	-0.3	-6
TimePilot	3850.6	10294
Tutankham	331.4	213
UpNDown	17952.0	89067
Venture	0.0	0
VideoPinball	407331.2	229402
WizardOfWor	2800.0	8953
YarsRevenge	25175.5	21596
Zaxxon	80.7	16544

Results Summary
Number of seeds	1
Number of common domains	52
Number of domains where paper scores higher	25
Number of domains where ChainerRL scores higher	24
Number of ties between paper and ChainerRL	3

prabhatnagarajan added 2 commits

May 20, 2019 02:36


          adds a3c results

df4e3cf


          addresses flake errors

3c7d5a4

prabhatnagarajan commented

View reviewed changes

examples/atari/a3c/train_a3c.py

		from chainerrl.wrappers import atari_wrappers


		class A3CFF(chainer.ChainList, a3c.A3CModel):

Contributor Author

prabhatnagarajan May 20, 2019

I'm assuming this follows the specification of the A3C paper:

"The agents used the network architecture
from (Mnih et al., 2013). The network used a convolutional layer with 16 filters of size 8 × 8 with stride
4, followed by a convolutional layer with with 32 filters of size 4 × 4 with stride 2, followed by a fully
connected layer with 256 hidden units. All three hidden layers were followed by a rectifier nonlinearity. The
value-based methods had a single linear output unit for each action representing the action-value. The model
used by actor-critic agents had two set of outputs – a softmax output with one entry per action representing the
probability of selecting the action, and a single linear output representing the value function." (Source: A3C Paper)

Contributor Author

prabhatnagarajan May 20, 2019

Source: Noisy Nets paper - "In each case, we used the neural network architecture from the
corresponding original papers for both the baseline and NoisyNet variant"

prabhatnagarajan commented

View reviewed changes

examples/atari/a3c/train_a3c.py

+                  parser.add_argument('--outdir', type=str, default='results',
+                                      help='Directory path to save output files.'
+                                           ' If it does not exist, it will be created.')
+                  parser.add_argument('--t-max', type=int, default=5)

Contributor Author

prabhatnagarajan May 20, 2019

Source: A3C paper, Appendix 8: "All methods performed updates after every 5 actions (tmax = 5 and
IUpdate = 5) and shared RMSProp was used for optimization"

prabhatnagarajan commented

View reviewed changes

examples/atari/a3c/README.md Show resolved Hide resolved

examples/atari/a3c/train_a3c.py

+                      # Feature extractor
+                      return np.asarray(x, dtype=np.float32) / 255
+                  agent = a3c.A3C(model, opt, t_max=args.t_max, gamma=0.99,

Contributor Author

prabhatnagarajan May 20, 2019

Source: A3C Paper - "All experiments
used a discount of γ = 0.99"

examples/atari/a3c/train_a3c.py

+                          [model(fake_obs)],
+                          os.path.join(args.outdir, 'model'))
+                  opt = rmsprop_async.RMSpropAsync(lr=7e-4, eps=1e-1, alpha=0.99)

Contributor Author

prabhatnagarajan May 20, 2019

Source: A3C Paper - "and an RMSProp decay factor of α = 0.99"

examples/atari/a3c/train_a3c.py

+                  parser.add_argument('--t-max', type=int, default=5)
+                  parser.add_argument('--beta', type=float, default=1e-2)
+                  parser.add_argument('--profile', action='store_true')
+                  parser.add_argument('--steps', type=int, default=8 * 10 ** 7)

Contributor Author

prabhatnagarajan May 20, 2019

Source: Noisy nets paper - "The DQN and A3C agents were training for
200M and 320M frames, respectively".

examples/atari/a3c/train_a3c.py Outdated Show resolved Hide resolved

examples/atari/a3c/train_a3c.py

		from chainerrl.wrappers import atari_wrappers


		class A3CFF(chainer.ChainList, a3c.A3CModel):

Contributor Author

prabhatnagarajan May 20, 2019

Source: Noisy Nets paper - "In each case, we used the neural network architecture from the
corresponding original papers for both the baseline and NoisyNet variant"

examples/atari/a3c/train_a3c.py

+                  parser.add_argument('--beta', type=float, default=1e-2)
+                  parser.add_argument('--profile', action='store_true')
+                  parser.add_argument('--steps', type=int, default=8 * 10 ** 7)
+                  parser.add_argument('--max-frames', type=int,

Contributor Author

prabhatnagarajan May 20, 2019

Source: Noisy Networks paper - "Episodes are truncated at 108K frames (or
30 minutes of simulated play) (van Hasselt et al., 2016)." However, it's unclear in the context whether this refers to training or testing. Given the nature of other Deep RL papers, I'm assuming the truncation applies to both training and evaluation.

Contributor Author

prabhatnagarajan commented May 20, 2019

Other comments/questions:

I'm unclear on the original A3C papers exploration strategy
We used 16 processes (as per the A3C paper), and no GPUs. However, we ran on 17 CPUs.
I'm assuming that the noisy networks paper held other details as per the original A3C paper (hence why I'm citing both noisy nets and A3C as a source)
The noisy networks paper uses an intermediate evaluation, not the final network evaluation. Since the noisy nets paper has the results we're comparing against, I use that evaluation protocol.
However, the evaluation protocol isn't 100% clear in the noisy nets paper: "Per-game
maximum scores are computed by taking the maximum raw scores of the agent and then averaging
over three seeds. However, for computing the human normalised scores in Figure 2, the raw scores
are evaluated every 1M frames and averaged over three seeds." Given that they imply that there's something different between the first and second sentence, I'm confused. Because it seems to me that in both evaluations, they evaluate every 1M frames and take the maximum performance.

prabhatnagarajan changed the title ~~[WIP] A3C Example for reproducing paper results.~~ A3C Example for reproducing paper results.

prabhatnagarajan requested a review from muupan

May 20, 2019 10:58

muupan requested changes

View reviewed changes

examples/atari/a3c/README.md Outdated Show resolved Hide resolved

examples/atari/a3c/README.md Outdated Show resolved Hide resolved

examples/atari/a3c/README.md Outdated Show resolved Hide resolved

examples/atari/a3c/README.md Outdated Show resolved Hide resolved

examples/atari/a3c/README.md Outdated Show resolved Hide resolved

examples/atari/a3c/README.md Show resolved Hide resolved

Member

muupan commented May 20, 2019

I'm unclear on the original A3C papers exploration strategy

A3C is on-policy, so I don't think they used any specific exploration strategy besides sampling from a current policy.

Member

muupan commented May 20, 2019

We used 16 processes (as per the A3C paper), and no GPUs. However, we ran on 17 CPUs.

Can you write so in README as well?

Contributor Author

prabhatnagarajan commented May 21, 2019 •

edited

Loading

A3C is on-policy, so I don't think they used any specific exploration strategy besides sampling from a current policy.

Quote from the paper: "The value based methods sampled the exploration rate epsilon from a distribution taking three values epsilon 1, epsilon 2, epsilon 3 with probabilities 0.4, 0.3, 0.3. The values of epsilon 1, epsilon 2, epsilon 3 were annealed from 1 to 0.1, 0.01, 0.5 respectively over the first four million frames. Advantage actor-critic used entropy regularization with a weight β = 0.01 for all Atari and TORCS experiments". It seems you're right. The value-based baselines used the epsilon greedy exploration.

prabhatnagarajan added 4 commits

May 21, 2019 09:12


          edits readme to address PR comments

d9b7854


          adds more details regarding computational resources for training

e95f862


          adds evaluation protocol section that addresses PR comments

bd02bb7


          changes evaluation frequency to be correct

a7483df

muupan reviewed

View reviewed changes

examples/atari/a3c/README.md Outdated Show resolved Hide resolved


          adds A3C scores

1a93888

Member

muupan commented Jun 5, 2019

Confirmed CI failure is caused by #463

muupan approved these changes

View reviewed changes

muupan merged commit e3e2093 into chainer:master

prabhatnagarajan deleted the a3c_example branch

June 7, 2019 16:37

muupan added this to the v0.7 milestone

muupan added the example label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels