Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A3C Example for reproducing paper results. #433

Merged
merged 12 commits into from
Jun 5, 2019

Conversation

prabhatnagarajan
Copy link
Contributor

No description provided.

@prabhatnagarajan
Copy link
Contributor Author

prabhatnagarajan commented May 20, 2019

Scores:

Comparison against original reported results...

Game ChainerRL Score Original Reported Scores
AirRaid 4625.9 N/A
Alien 1397.2 2027
Amidar 1110.8 904
Assault 5821.6 2879
Asterix 6820.7 6822
Asteroids 2428.8 2544
Atlantis 732425.0 422700
BankHeist 1308.9 1296
BattleZone 5421.1 16411
BeamRider 8493.4 9214
Berzerk 1594.2 1022
Bowling 31.7 37
Boxing 98.1 91
Breakout 533.6 496
Carnival 5132.9 N/A
Centipede 4849.9 5350
ChopperCommand 4881.0 5285
CrazyClimber 124400.0 134783
Defender N/A 52917.0
DemonAttack 108832.5 37085
DoubleDunk 1.5 3
Enduro 0.0 0
FishingDerby 36.3 -7
Freeway 0.0 0
Frostbite 313.6 288
Gopher 8746.5 7992
Gravitar 228.0 379
Hero 36892.5 30791
IceHockey -4.6 -2
JamesBond N/A 509.0
Jamesbond 370.1 N/A
JourneyEscape -871.2 N/A
Kangaroo 115.8 1166
Krull 10601.4 9422
KungFuMaster 40970.4 37422
MontezumaRevenge 1.9 14
MsPacman 2498.0 2436
NameThisGame 6597.0 7168
Phoenix 42654.5 9476
Pitfall -10.8 N/A
Pitfall! N/A 0.0
Pong 20.9 7
Pooyan 4067.9 N/A
PrivateEye 376.1 3781
Qbert 15610.6 18586
Riverraid 13223.3 N/A
RoadRunner 39897.8 45315
Robotank 2.9 6
Seaquest 1786.5 1744
Skiing -16090.5 -12972
Solaris 3157.8 12380
SpaceInvaders 1630.6 1034
StarGunner 57943.2 49156
Surround N/A -8.0
Tennis -0.3 -6
TimePilot 3850.6 10294
Tutankham 331.4 213
UpNDown 17952.0 89067
Venture 0.0 0
VideoPinball 407331.2 229402
WizardOfWor 2800.0 8953
YarsRevenge 25175.5 21596
Zaxxon 80.7 16544
Results Summary
Number of seeds 1
Number of common domains 52
Number of domains where paper scores higher 25
Number of domains where ChainerRL scores higher 24
Number of ties between paper and ChainerRL 3

from chainerrl.wrappers import atari_wrappers


class A3CFF(chainer.ChainList, a3c.A3CModel):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this follows the specification of the A3C paper:

"The agents used the network architecture
from (Mnih et al., 2013). The network used a convolutional layer with 16 filters of size 8 × 8 with stride
4, followed by a convolutional layer with with 32 filters of size 4 × 4 with stride 2, followed by a fully
connected layer with 256 hidden units. All three hidden layers were followed by a rectifier nonlinearity. The
value-based methods had a single linear output unit for each action representing the action-value. The model
used by actor-critic agents had two set of outputs – a softmax output with one entry per action representing the
probability of selecting the action, and a single linear output representing the value function." (Source: A3C Paper)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Source: Noisy Nets paper - "In each case, we used the neural network architecture from the
corresponding original papers for both the baseline and NoisyNet variant"

parser.add_argument('--outdir', type=str, default='results',
help='Directory path to save output files.'
' If it does not exist, it will be created.')
parser.add_argument('--t-max', type=int, default=5)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Source: A3C paper, Appendix 8: "All methods performed updates after every 5 actions (tmax = 5 and
IUpdate = 5) and shared RMSProp was used for optimization"

examples/atari/a3c/README.md Show resolved Hide resolved
# Feature extractor
return np.asarray(x, dtype=np.float32) / 255

agent = a3c.A3C(model, opt, t_max=args.t_max, gamma=0.99,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Source: A3C Paper - "All experiments
used a discount of γ = 0.99"

[model(fake_obs)],
os.path.join(args.outdir, 'model'))

opt = rmsprop_async.RMSpropAsync(lr=7e-4, eps=1e-1, alpha=0.99)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Source: A3C Paper - "and an RMSProp decay factor of α = 0.99"

parser.add_argument('--t-max', type=int, default=5)
parser.add_argument('--beta', type=float, default=1e-2)
parser.add_argument('--profile', action='store_true')
parser.add_argument('--steps', type=int, default=8 * 10 ** 7)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Source: Noisy nets paper - "The DQN and A3C agents were training for
200M and 320M frames, respectively".

examples/atari/a3c/train_a3c.py Outdated Show resolved Hide resolved
from chainerrl.wrappers import atari_wrappers


class A3CFF(chainer.ChainList, a3c.A3CModel):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Source: Noisy Nets paper - "In each case, we used the neural network architecture from the
corresponding original papers for both the baseline and NoisyNet variant"

parser.add_argument('--beta', type=float, default=1e-2)
parser.add_argument('--profile', action='store_true')
parser.add_argument('--steps', type=int, default=8 * 10 ** 7)
parser.add_argument('--max-frames', type=int,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Source: Noisy Networks paper - "Episodes are truncated at 108K frames (or
30 minutes of simulated play) (van Hasselt et al., 2016)." However, it's unclear in the context whether this refers to training or testing. Given the nature of other Deep RL papers, I'm assuming the truncation applies to both training and evaluation.

@prabhatnagarajan
Copy link
Contributor Author

Other comments/questions:

  • I'm unclear on the original A3C papers exploration strategy
  • We used 16 processes (as per the A3C paper), and no GPUs. However, we ran on 17 CPUs.
  • I'm assuming that the noisy networks paper held other details as per the original A3C paper (hence why I'm citing both noisy nets and A3C as a source)
  • The noisy networks paper uses an intermediate evaluation, not the final network evaluation. Since the noisy nets paper has the results we're comparing against, I use that evaluation protocol.
  • However, the evaluation protocol isn't 100% clear in the noisy nets paper: "Per-game
    maximum scores are computed by taking the maximum raw scores of the agent and then averaging
    over three seeds. However, for computing the human normalised scores in Figure 2, the raw scores
    are evaluated every 1M frames and averaged over three seeds." Given that they imply that there's something different between the first and second sentence, I'm confused. Because it seems to me that in both evaluations, they evaluate every 1M frames and take the maximum performance.

@prabhatnagarajan prabhatnagarajan changed the title [WIP] A3C Example for reproducing paper results. A3C Example for reproducing paper results. May 20, 2019
examples/atari/a3c/README.md Outdated Show resolved Hide resolved
examples/atari/a3c/README.md Outdated Show resolved Hide resolved
examples/atari/a3c/README.md Outdated Show resolved Hide resolved
examples/atari/a3c/README.md Outdated Show resolved Hide resolved
examples/atari/a3c/README.md Outdated Show resolved Hide resolved
examples/atari/a3c/README.md Show resolved Hide resolved
@muupan
Copy link
Member

muupan commented May 20, 2019

I'm unclear on the original A3C papers exploration strategy

A3C is on-policy, so I don't think they used any specific exploration strategy besides sampling from a current policy.

@muupan
Copy link
Member

muupan commented May 20, 2019

We used 16 processes (as per the A3C paper), and no GPUs. However, we ran on 17 CPUs.

Can you write so in README as well?

@prabhatnagarajan
Copy link
Contributor Author

prabhatnagarajan commented May 21, 2019

A3C is on-policy, so I don't think they used any specific exploration strategy besides sampling from a current policy.

Quote from the paper: "The value based methods sampled the exploration rate epsilon from a distribution taking three values epsilon 1, epsilon 2, epsilon 3 with probabilities 0.4, 0.3, 0.3. The values of epsilon 1, epsilon 2, epsilon 3 were annealed from 1 to 0.1, 0.01, 0.5 respectively over the first four million frames. Advantage actor-critic used entropy regularization with a weight β = 0.01 for all Atari and TORCS experiments". It seems you're right. The value-based baselines used the epsilon greedy exploration.

examples/atari/a3c/README.md Outdated Show resolved Hide resolved
@muupan
Copy link
Member

muupan commented Jun 5, 2019

Confirmed CI failure is caused by #463

@muupan muupan merged commit e3e2093 into chainer:master Jun 5, 2019
@prabhatnagarajan prabhatnagarajan deleted the a3c_example branch June 7, 2019 16:37
@muupan muupan added this to the v0.7 milestone Jun 28, 2019
@muupan muupan added the example label Jun 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants