Code written by Galen Cho (Woon Sang Cho): https://github.com/woonsangcho
This is an implementation of Proximal Policy Optimization (PPO)[1][2], which is a variant of Trust Region Policy Optimization (TRPO)[3].
This is one version that resulted from experimenting a number of variants, in particular with loss functions, advantages[4], normalization, and a few other tricks in the reference papers. I have clearly commented so that it is easy to follow, along with page numbers wherever necessary.
- The parametrized value function is updated using the previous batch of trajectories, whereas the parametrized policy function and advantages are updated using the current batch of trajectories, as suggested in [5] to avoid overfitting. Please see the reference [5] for the core training scheme.
- The network architecture for both value and policy is fixed with
two
hidden layers, each with128
nodes withtanh
activation. The kernels in the value network are l2 regularized. - As an attempt to constrain the output space of the value network, I normalized each reward such that most of the mass of discounted sum of rewards roughly lies on the boundary of a unit ball, i.e.
unit length in l_0 norm of discounted rewards
by keeping running statistics. However, this resulted in mapping states to values which previous states were mapped to in previous iterations, resulting in slow learning curve. - The kernels in the layers are initialized with
RandomNormal(mean=0, std=0.1)
, which is a heuristic I've found to be useful. The effect is that the outputs for mean action are centered around 0 during the initial stage of policy learning. If this center is very much off from 0, it results in a large random fluctuation of sampled actions, taking longer time to learn. - While it is noted in [3] that a
separate set of parameters specifies the log standard deviation of each element
, I've experimented with merged network outputting bothmean
andsigma
, and with two separate networks for each. They had poor performance so the source is omitted from the repository. - The value function is trained using the built-in fit routine in keras for convenient epoch and batch-size management. The policy function is trained using tensorflow over the entire batch of trajectories. You may modify the source to truncate the episodic rollout to a fixed horizon size
T
. The suggested size isT=2048
, noted in the reference.
- tensorflow (1.4.0)
- keras (2.0.9)
- numpy
- scipy
- openai gym (0.9.4)
- MuJoCo
These hyper-parameters follow the references, with some futher changes. Please see find my comments for details.
policy_learning_rate = 1 * 1e-04
value_learning_rate = 1.5 * 1e-03
n_policy_epochs = 20
n_value_epochs = 15
value_batch_size = 512
kl_target = 0.003
beta = 1
beta_max = 20
beta_min = 1/20
ksi = 10
reward_discount = 0.995
gae_discount = 0.975
traj_batch_size = 10 # per batch number of episodes to collect for each training iteration
activation = 'tanh'
python main.py --environ-string='InvertedPendulum-v1' --max-episode-count=1000
python main.py --environ-string='InvertedDoublePendulum-v1' --max-episode-count=15000
python main.py --environ-string='Hopper-v1' --max-episode-count=20000
python main.py --environ-string='HalfCheetah-v1' --max-episode-count=20000
python main.py --environ-string='Swimmer-v1' --max-episode-count=5000
python main.py --environ-string='Ant-v1' --max-episode-count=150000
python main.py --environ-string='Reacher-v1' --max-episode-count=50000
python main.py --environ-string='Walker2d-v1' --max-episode-count=30000
python main.py --environ-string='Humanoid-v1' --max-episode-count=150000
python main.py --environ-string='HumanoidStandup-v1' --max-episode-count=150000
The default seed input is 1989
. You can append --seed=<value>
to experiment different seeds.
- Proximal Policy Optimization Algorithms (Schulman et al. 2017)
- Emergence of Locomotion Behaviours in Rich Environments (Heese et al. 2017)
- Trust Region Policy Optimization (Schulman et al. 2015)
- High-Dimensional Continuous Control Using Generalized Advantage Estimation (Schulman et al. 2015)
- Towards Generalization and Simplicity in Continuous Control (Rajeswaran et al.)
- Repository 1 for helpful implementation pointers (Schulman)
- Repository 2 for helpful implementation pointers (Coady)