Kioku

DQN

DQN CartPole episode rewards

Using:

DQN
Double q networks
Epsilon decay

Masters cartpole after only 92 episodes?

DQN CartPole gif

A2C

A2C CartPole episode rewards

Using:

A2C
~~GAE~~
N-Step Returns ~~(with GAE)~~

Note that A2C is much less sample efficient than DQN and the SOTA (PPO, TD3, SAC).

Lunar Lander:

A2C Lunar Lander episode rewards

A2C Lunar Lander gif

PPO

PPO CartPole episode rewards

...did I just successfully implement PPO?

Using:

PPO (basically A2C with a few extra steps)
GAE
N-Step Returns (with GAE)
Mini-batch learning
Multiple learning iterations per batch

As sample efficient as DQN? (minus the memory)

PPO CartPole gif

Lunar Lander:

PPO Lunar Lander episode rewards

PPO Lunar Lander gif