Skip to content

Commit

Permalink
Advantage Actor Critic (A2C) Model (#598)
Browse files Browse the repository at this point in the history
* Apply suggestions from code review

Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
  • Loading branch information
6 people authored Aug 13, 2021
1 parent 2d7ae88 commit bd28835
Show file tree
Hide file tree
Showing 11 changed files with 617 additions and 16 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,12 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).


## [unReleased] - 2021-MM-DD

### Added

- Added Advantage Actor-Critic (A2C) Model [#598](https://github.com/PyTorchLightning/lightning-bolts/pull/598))


### Changed

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
112 changes: 105 additions & 7 deletions docs/source/reinforce_learn.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ Contributions by: `Donal Byrne <https://github.com/djbyrne>`_
RL models currently only support CPU and single GPU training with `distributed_backend=dp`.
Full GPU support will be added in later updates.

------------

DQN Models
----------
Expand Down Expand Up @@ -86,7 +87,7 @@ Example::
trainer = Trainer()
trainer.fit(dqn)

.. autoclass:: pl_bolts.models.rl.dqn_model.DQN
.. autoclass:: pl_bolts.models.rl.DQN
:noindex:

---------------
Expand Down Expand Up @@ -150,7 +151,7 @@ Example::
trainer = Trainer()
trainer.fit(ddqn)

.. autoclass:: pl_bolts.models.rl.double_dqn_model.DoubleDQN
.. autoclass:: pl_bolts.models.rl.DoubleDQN
:noindex:

---------------
Expand Down Expand Up @@ -240,7 +241,7 @@ Example::
trainer = Trainer()
trainer.fit(dueling_dqn)

.. autoclass:: pl_bolts.models.rl.dueling_dqn_model.DuelingDQN
.. autoclass:: pl_bolts.models.rl.DuelingDQN
:noindex:

--------------
Expand Down Expand Up @@ -326,7 +327,7 @@ Example::
trainer = Trainer()
trainer.fit(noisy_dqn)

.. autoclass:: pl_bolts.models.rl.noisy_dqn_model.NoisyDQN
.. autoclass:: pl_bolts.models.rl.NoisyDQN
:noindex:

--------------
Expand Down Expand Up @@ -519,7 +520,7 @@ Example::
trainer = Trainer()
trainer.fit(per_dqn)

.. autoclass:: pl_bolts.models.rl.per_dqn_model.PERDQN
.. autoclass:: pl_bolts.models.rl.PERDQN
:noindex:


Expand Down Expand Up @@ -611,7 +612,7 @@ Example::
trainer = Trainer()
trainer.fit(reinforce)

.. autoclass:: pl_bolts.models.rl.reinforce_model.Reinforce
.. autoclass:: pl_bolts.models.rl.Reinforce
:noindex:

--------------
Expand Down Expand Up @@ -664,5 +665,102 @@ Example::
trainer = Trainer()
trainer.fit(vpg)

.. autoclass:: pl_bolts.models.rl.vanilla_policy_gradient_model.VanillaPolicyGradient
.. autoclass:: pl_bolts.models.rl.VanillaPolicyGradient
:noindex:

--------------

Actor-Critic Models
-------------------
The following models are based on Actor Critic. Actor Critic conbines the approaches of value-based learning (the DQN family)
and the policy-based learning (the PG family) by learning the value function as well as the policy distribution. This approach
updates the policy network according to the policy gradient, and updates the value network to fit the discounted rewards.

Actor Critic Key Points:
- Actor outputs a distribution of actions for controlling the agent
- Critic outputs a value of current state for policy update suggestion
- The addition of critic allows the model to do n-step training instead of generating an entire trajectory

Advantage Actor Critic (A2C)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(Asynchronous) Advantage Actor Critic model introduced in `Asynchronous Methods for Deep Reinforcement Learning <https://arxiv.org/abs/1602.01783>`_
Paper authors: Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu

Original implementation by: `Jason Wang <https://github.com/blahBlahhhJ>`_

Advantage Actor Critic (A2C) is the classical actor critic approach in reinforcement learning. The underlying neural
network has an actor head and a critic head to output action distribution as well as value of current state. Usually the
first few layers are shared by the two heads to prevent learning similar stuff twice. It builds upon the idea of using a
baseline of average reward to reduce variance (in VPG) by using the critic as a baseline which could theoretically have
better performance.

The algorithm can use an n-step training approach instead of generating an entire trajectory. The algorithm is as follows:

1. Initialize our network.
2. Rollout n steps and save the transitions (states, actions, rewards, values, dones).
3. Calculate the n-step (discounted) return by bootstrapping the last value.

.. math::
G_{n+1} = V_{n+1}, G_t = r_t + \gamma G_{t+1} \ \forall t \in [0,n]
4. Calculate actor loss using values as baseline.

.. math::
L_{actor} = - \frac1n \sum_t (G_t - V_t) \log \pi (a_t | s_t)
5. Calculate critic loss using returns as target.

.. math::
L_{critic} = \frac1n \sum_t (V_t - G_t)^2
6. Calculate entropy bonus to encourage exploration.

.. math::
H_\pi = - \frac1n \sum_t \pi (a_t | s_t) \log \pi (a_t | s_t)
7. Calculate total loss as a weighted sum of the three components above.

.. math::
L = L_{actor} + \beta_{critic} L_{critic} - \beta_{entropy} H_\pi
8. Perform gradient descent to update our network.

.. note::
The current implementation only support discrete action space, and has only been tested on the CartPole environment.

A2C Benefits
~~~~~~~~~~~~~~~

- Combines the benefit from value-based learning and policy-based learning

- Further reduces variance using the critic as a value estimator

A2C Results
~~~~~~~~~~~~~~~~

Hyperparameters:

- Batch Size: 32
- Learning Rate: 0.001
- Entropy Beta: 0.01
- Critic Beta: 0.5
- Gamma: 0.99

.. image:: _images/rl_benchmark/cartpole_a2c_results.jpg
:width: 300
:alt: A2C Results

Example::

from pl_bolts.models.rl import AdvantageActorCritic
a2c = AdvantageActorCritic("CartPole-v0")
trainer = Trainer()
trainer.fit(a2c)

.. autoclass:: pl_bolts.models.rl.AdvantageActorCritic
:noindex:
16 changes: 9 additions & 7 deletions pl_bolts/models/rl/__init__.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
from pl_bolts.models.rl.double_dqn_model import DoubleDQN # noqa: F401
from pl_bolts.models.rl.dqn_model import DQN # noqa: F401
from pl_bolts.models.rl.dueling_dqn_model import DuelingDQN # noqa: F401
from pl_bolts.models.rl.noisy_dqn_model import NoisyDQN # noqa: F401
from pl_bolts.models.rl.per_dqn_model import PERDQN # noqa: F401
from pl_bolts.models.rl.reinforce_model import Reinforce # noqa: F401
from pl_bolts.models.rl.vanilla_policy_gradient_model import VanillaPolicyGradient # noqa: F401
from pl_bolts.models.rl.advantage_actor_critic_model import AdvantageActorCritic
from pl_bolts.models.rl.double_dqn_model import DoubleDQN
from pl_bolts.models.rl.dqn_model import DQN
from pl_bolts.models.rl.dueling_dqn_model import DuelingDQN
from pl_bolts.models.rl.noisy_dqn_model import NoisyDQN
from pl_bolts.models.rl.per_dqn_model import PERDQN
from pl_bolts.models.rl.reinforce_model import Reinforce
from pl_bolts.models.rl.vanilla_policy_gradient_model import VanillaPolicyGradient

__all__ = [
"AdvantageActorCritic",
"DoubleDQN",
"DQN",
"DuelingDQN",
Expand Down
Loading

0 comments on commit bd28835

Please sign in to comment.