DQN
- DQN
- Double q networks
- Epsilon decay
Masters cartpole after only 92 episodes?
A2C
- A2C
GAE
- N-Step Returns
(with GAE)
Note that A2C is much less sample efficient than DQN
and the SOTA (PPO, TD3, SAC).
PPO
...did I just successfully implement PPO?
- PPO (basically A2C with a few extra steps)
- GAE
- N-Step Returns (with GAE)
- Mini-batch learning
- Multiple learning iterations per batch
As sample efficient as DQN? (minus the memory)