Advantage Actor Critic (A2C) Model (#598)

* Apply suggestions from code review Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jirka <jirka.borovec@seznam.cz> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Lightning-Universe · Aug 13, 2021 · bd28835 · bd28835
1 parent 2d7ae88
commit bd28835
Show file tree

Hide file tree

Showing 11 changed files with 617 additions and 16 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,11 +4,12 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-
 ## [unReleased] - 2021-MM-DD
 
 ### Added
 
+- Added Advantage Actor-Critic (A2C) Model [#598](https://github.com/PyTorchLightning/lightning-bolts/pull/598))
+
 
 ### Changed
 

diff --git a/docs/source/_images/rl_benchmark/cartpole_a2c_results.jpg b/docs/source/_images/rl_benchmark/cartpole_a2c_results.jpg
diff --git a/docs/source/reinforce_learn.rst b/docs/source/reinforce_learn.rst
@@ -25,6 +25,7 @@ Contributions by: `Donal Byrne <https://github.com/djbyrne>`_
     RL models currently only support CPU and single GPU training with `distributed_backend=dp`.
     Full GPU support will be added in later updates.
 
+------------
 
 DQN Models
 ----------
@@ -86,7 +87,7 @@ Example::
     trainer = Trainer()
     trainer.fit(dqn)
 
-.. autoclass:: pl_bolts.models.rl.dqn_model.DQN
+.. autoclass:: pl_bolts.models.rl.DQN
    :noindex:
 
 ---------------
@@ -150,7 +151,7 @@ Example::
     trainer = Trainer()
     trainer.fit(ddqn)
 
-.. autoclass:: pl_bolts.models.rl.double_dqn_model.DoubleDQN
+.. autoclass:: pl_bolts.models.rl.DoubleDQN
    :noindex:
 
 ---------------
@@ -240,7 +241,7 @@ Example::
     trainer = Trainer()
     trainer.fit(dueling_dqn)
 
-.. autoclass:: pl_bolts.models.rl.dueling_dqn_model.DuelingDQN
+.. autoclass:: pl_bolts.models.rl.DuelingDQN
    :noindex:
 
 --------------
@@ -326,7 +327,7 @@ Example::
     trainer = Trainer()
     trainer.fit(noisy_dqn)
 
-.. autoclass:: pl_bolts.models.rl.noisy_dqn_model.NoisyDQN
+.. autoclass:: pl_bolts.models.rl.NoisyDQN
    :noindex:
 
 --------------
@@ -519,7 +520,7 @@ Example::
     trainer = Trainer()
     trainer.fit(per_dqn)
 
-.. autoclass:: pl_bolts.models.rl.per_dqn_model.PERDQN
+.. autoclass:: pl_bolts.models.rl.PERDQN
    :noindex:
 
 
@@ -611,7 +612,7 @@ Example::
     trainer = Trainer()
     trainer.fit(reinforce)
 
-.. autoclass:: pl_bolts.models.rl.reinforce_model.Reinforce
+.. autoclass:: pl_bolts.models.rl.Reinforce
    :noindex:
 
 --------------
@@ -664,5 +665,102 @@ Example::
     trainer = Trainer()
     trainer.fit(vpg)
 
-.. autoclass:: pl_bolts.models.rl.vanilla_policy_gradient_model.VanillaPolicyGradient
+.. autoclass:: pl_bolts.models.rl.VanillaPolicyGradient
+   :noindex:
+
+--------------
+
+Actor-Critic Models
+-------------------
+The following models are based on Actor Critic. Actor Critic conbines the approaches of value-based learning (the DQN family)
+and the policy-based learning (the PG family) by learning the value function as well as the policy distribution. This approach
+updates the policy network according to the policy gradient, and updates the value network to fit the discounted rewards.
+
+Actor Critic Key Points:
+    - Actor outputs a distribution of actions for controlling the agent
+    - Critic outputs a value of current state for policy update suggestion
+    - The addition of critic allows the model to do n-step training instead of generating an entire trajectory
+
+Advantage Actor Critic (A2C)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+(Asynchronous) Advantage Actor Critic model introduced in `Asynchronous Methods for Deep Reinforcement Learning <https://arxiv.org/abs/1602.01783>`_
+Paper authors: Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu
+
+Original implementation by: `Jason Wang <https://github.com/blahBlahhhJ>`_
+
+Advantage Actor Critic (A2C) is the classical actor critic approach in reinforcement learning. The underlying neural
+network has an actor head and a critic head to output action distribution as well as value of current state. Usually the
+first few layers are shared by the two heads to prevent learning similar stuff twice. It builds upon the idea of using a
+baseline of average reward to reduce variance (in VPG) by using the critic as a baseline which could theoretically have
+better performance.
+
+The algorithm can use an n-step training approach instead of generating an entire trajectory. The algorithm is as follows:
+
+1. Initialize our network.
+2. Rollout n steps and save the transitions (states, actions, rewards, values, dones).
+3. Calculate the n-step (discounted) return by bootstrapping the last value.
+
+.. math::
+
+  G_{n+1} = V_{n+1}, G_t = r_t + \gamma G_{t+1} \ \forall t \in [0,n]
+
+4. Calculate actor loss using values as baseline.
+
+.. math::
+
+  L_{actor} = - \frac1n \sum_t (G_t - V_t) \log \pi (a_t | s_t)
+
+5. Calculate critic loss using returns as target.
+
+.. math::
+  L_{critic} = \frac1n \sum_t (V_t - G_t)^2
+
+6. Calculate entropy bonus to encourage exploration.
+
+.. math::
+
+  H_\pi = - \frac1n \sum_t \pi (a_t | s_t) \log \pi (a_t | s_t)
+
+7. Calculate total loss as a weighted sum of the three components above.
+
+.. math::
+
+  L = L_{actor} + \beta_{critic} L_{critic} - \beta_{entropy} H_\pi
+
+8. Perform gradient descent to update our network.
+
+.. note::
+  The current implementation only support discrete action space, and has only been tested on the CartPole environment.
+
+A2C Benefits
+~~~~~~~~~~~~~~~
+
+- Combines the benefit from value-based learning and policy-based learning
+
+- Further reduces variance using the critic as a value estimator
+
+A2C Results
+~~~~~~~~~~~~~~~~
+
+Hyperparameters:
+
+- Batch Size: 32
+- Learning Rate: 0.001
+- Entropy Beta: 0.01
+- Critic Beta: 0.5
+- Gamma: 0.99
+
+.. image:: _images/rl_benchmark/cartpole_a2c_results.jpg
+  :width: 300
+  :alt: A2C Results
+
+Example::
+
+    from pl_bolts.models.rl import AdvantageActorCritic
+    a2c = AdvantageActorCritic("CartPole-v0")
+    trainer = Trainer()
+    trainer.fit(a2c)
+
+.. autoclass:: pl_bolts.models.rl.AdvantageActorCritic
    :noindex:
diff --git a/pl_bolts/models/rl/__init__.py b/pl_bolts/models/rl/__init__.py
@@ -1,12 +1,14 @@
-from pl_bolts.models.rl.double_dqn_model import DoubleDQN  # noqa: F401
-from pl_bolts.models.rl.dqn_model import DQN  # noqa: F401
-from pl_bolts.models.rl.dueling_dqn_model import DuelingDQN  # noqa: F401
-from pl_bolts.models.rl.noisy_dqn_model import NoisyDQN  # noqa: F401
-from pl_bolts.models.rl.per_dqn_model import PERDQN  # noqa: F401
-from pl_bolts.models.rl.reinforce_model import Reinforce  # noqa: F401
-from pl_bolts.models.rl.vanilla_policy_gradient_model import VanillaPolicyGradient  # noqa: F401
+from pl_bolts.models.rl.advantage_actor_critic_model import AdvantageActorCritic
+from pl_bolts.models.rl.double_dqn_model import DoubleDQN
+from pl_bolts.models.rl.dqn_model import DQN
+from pl_bolts.models.rl.dueling_dqn_model import DuelingDQN
+from pl_bolts.models.rl.noisy_dqn_model import NoisyDQN
+from pl_bolts.models.rl.per_dqn_model import PERDQN
+from pl_bolts.models.rl.reinforce_model import Reinforce
+from pl_bolts.models.rl.vanilla_policy_gradient_model import VanillaPolicyGradient
 
 __all__ = [
+    "AdvantageActorCritic",
     "DoubleDQN",
     "DQN",
     "DuelingDQN",