PPO And Friends (PPO-AF) is an MPI distributed PyTorch implementation of Proximal Policy Optimation along with various extra optimizations and add-ons (freinds).
We are currently compatible with the following environment frameworks:
- Gymnasium
- Gym (including versions <= 0.21)
- PettingZoo
- Abmarl Gridworld
Some of our friends:
- Decentralized Distributed Proximal Policy Optimization (DD-PPO)
- Intrinsic Curiosity Module (ICM)
- Multi Agent Proximal Policy Optimization (MAPPO)
- Multi-Agent Transformer (MAT)
- Generalized Advantage Estimations (GAE)
- LSTM
- Gradient, reward, bootstrap, value, and observation clipping
- KL based early ending
- KL punishment
- Observation, advantage, and reward normalization
- Advantage re-calculation
- Vectorized environments
For a full list of policy options and their defaults, see
ppo_and_friends/policies/
Note that this implementation of PPO uses separate networks for critics and actors (except for the Multi-Agent Transformer).
While you can install a barebones version of PPO-AF by simply issuing a
pip install .
command, most situations will require installing one of our
supported RL library extensions:
- gym (version 0.21.0):
pip install .[gym]
- gymnasium:
pip install .[gymnasium]
- abmarl:
pip install .[abmarl]
- petting zoo:
pip install .[pettingzoo]
Installing the gym extension may also require downgrading your pip wheel:
pip install --upgrade pip wheel==0.38.4
To train an environment, an EnvironmentRunner must first be defined. The
runner will be a class that inherits from
EnvironmentRunner or the GymRunner.
located within the same module. The only method you need to define is
run
, which should call self.run_ppo(...)
.
Make note of the following requirements:
- your environment MUST be wrapped in one of the available ppo-and-friends environment wrappers. Currently available wrappers are SingleAgentGymWrapper, MultiAgentGymWrapper, AbmarlWrapper, and ParallelZooWrapper. See Environment Wrappers for more info.
- You must add the @ppoaf_runner decorator to your class.
See the baselines
directory for more examples.
To train an environment, use the following command:
ppoaf train <path_to_runner_file>
Running the same command again will result in loading the previously
saved state. You can re-run from scratch by using the --clobber
option.
A complete list of options can be seen with the help
command:
ppoaf --help
PPO-AF is designed to work seamlessly with MPI. To train across multiple ranks and nodes, simply issue your MPI command followed by the PPO-AF command.
Examples:
mpirun:
mpirun -n {num_procs} ppoaf ...
srun:
srun -N1 -n {num_procs} ppoaf ...
The current implementation of multiple environment instances per
processor assumes that the rollout bottleneck will come from inference rather
than stepping through the environment. Because of this, the multiple environment
instances are run in succession rather than in parallel, and the speed up
comes from batched inference during the rollout. Very slow environments may
not see a performance gain from increasing envs_per_proc
.
Examples:
mpirun:
mpirun -n {num_procs} ppoaf --envs_per_proc {envs_per_proc} ...
srun:
srun -N1 -n {num_procs} ppoaf --envs_per_proc {envs_per_proc} ...
To test a model that has been trained on a particular environment, you can issue the following command:
ppoaf test <path_to_output_directory> --num_test_runs <num_test_runs> --render
By default, exploration is enabled during testing, but you can disable it
with the --deterministic
flag. Example:
ppoaf test <path_to_output_directory> --num_test_runs <num_test_runs> --render --deterministic
The output directory will be given the same name as your runner file, and
it will appear in the path specified by --state_path
when training, which
defaults to ./saved_states
.
If --save_train_scores
is used while training, the results can be plotted using
PPO-And-Friend's ploting utility.
ppoaf plot path1 path2 path3 ... <options>
Terminology varies across implemenations and publications, so here are some commonly overloaded terms and how we define them.
- batch size: we refer to the gradient descent mini-batch size as the
batch size. This is sometimes referred to as 'mini batch size',
'sgd mini batch size', etc. This is defined as
batch_size
in our code. - timesteps per rollout: this refers to the total number of timesteps
collected in a single rollout. This is sometimes referred to as the batch
size. This is defined on a per-environment per-processor basis, i.e.
ts_per_rollout
will be internally redefined asts_per_rollout = (num_procs * ts_per_rollout * envs_per_proc) / num_procs
. - max timesteps per episode: this refers to the maximum number of
timesteps collected for a single episode trajectory. This is sometimes
referred to as horizon or trajectory length. If max timesteps
per episode is 10, and we're collecting 100 timesteps in our rollout
on a single processor, then we'll end up with 10 episodes of length 10.
Note that the environment does not need to enter a done state for
an episode's trajectory to end. This is defined as
max_ts_per_ep
in our code.
This implementation of PPO supports both single and multi-agent environments, and, as such, there are many design decisions to made. Currently, ppo-and-friends follows the standards outlined below.
- All actions sent to the step function will be wrapped in a dictionary mapping agent ids to actions.
- Calling
env.step(actions)
will result in a tuple of the following form:(obs, critic_obs, reward, info, done)
s.t. each tuple element is a dictionary mapping agent ids to the appropriate data. - Death masking is used at all times, which means that all agents are
expected to exist in the
step
results as long as an episode hasn't terminated.
Since not all environments will adhere to the above standards, various wrappers are provided in the environments directory. For best results, all environments should be wrapped in a class inherting from PPOEnvironmentWrapper.
This is the default policy for single-agent environments.
arXiv:2103.01955v4 makes the distinction between MAPPO and IPPO such that the former uses a centralized critic receiving global information about the agents of a shared policy (usually a concatenation of the observations), and the later uses an independent, decentralized critic.
Both options can be enabled by setting the critic_view
parameter in
the PPOEnvironmentWrapper appropriately. Options as of now are
"global", "policy", and "local".
- global: this option will send observations from ALL agents in the environment, regardless of which policy they belong to, to every critic. Note that, when using a single policy, this is identical to MAPPO. However, when using multiple policies, each critic can see the observations of other policies.
- policy: this option will combine observations from all agents under shared policies, and the critics of those policies will receive the shared observations. This option is identical to MAPPO when using a single policy, and it alows for similar behavior when using multiple polices (multiple policies was not convered in the paper, but this general concept translates well).
- local: this option will send local observations from each agent to the critic of their respective policy. This is IPPO when using a single policy with multiple agents and PPO when using a single policy with one agent.
All multi-agent environment wrappers that inherit from PPOEnvironmentWrapper
allow users to set critic_view
with the exception of MAT
, which cannot
decouple the critic's from the actors' observations.
The Multi-Agent Transformer (MAT) can be enabled by setting a policie's class to MATPolicy. Different policy classses can be used for different policies within the same game. For instance, you can have one team use MATPolicy and another team use PPOPolicy.
The implemenation of MAT within PPO-AF follows the original publication as closely as possible. Some exceptions were made to account for differences between the publication and it's associated source code and differences in architecture between PPO-AF and the publication's source code.
Full details on MAT can be found at its official site: https://sites.google.com/view/multi-agent-transformer
Both single agent and multi-agent gymnasium games are supported through the SingleAgentGymWrapper and MultiAgentGymWrapper, respectively. For examples on how to train a gymnasium environment, check out the runners in baselines/gymnasium/.
IMPORTANT: While Gymnasium does not have a standard interface for multi-agent games, I've found some commonalities among many publications, and we are using this as our standard. You may need to make changes to your multi-agent gymnasium environments before they can be wrapped in the MultiAgentGymWrapper.
Our expectaions of multi-agent Gymnasium environments are as follows:
- The step method must return observation, reward, terminated, truncated, info. observation, reward, terminated, and truncated must be iterables s.t. each index maps to a specific agent, and this order must not change. info must be a dict.
- The reset method must return the agent observations as an iterable with the same index constraints defined above.
- Both
env.observation_space
andenv.action_space
must be iterables such that indices map to agents in the same order they are given from the step and reset methods.
For environments that only exist in versions <= 0.21 of Gym, you can use the Gym21ToGymnasium wrapper. See baselines/gym/ for examples.
IMPORTANT: While Gym does not have a standard interface for multi-agent games, I've found some commonalities among many publications, and we are using this as our standard. You may need to make changes to your multi-agent gymnasium environments before they can be wrapped in the MultiAgentGymWrapper.
Our expectaions of multi-agent Gym environments are as follows:
- The step method must return observation, reward, done, info. observation, reward, and done must be iterables s.t. each index maps to a specific agent, and this order must not change. info must be a dict.
- The reset method must return the agent observations as an iterable with the same index constraints defined above.
- Both
env.observation_space
andenv.action_space
must be iterables such that indices map to agents in the same order they are given from the step and reset methods.
Games that exist in Gym versions >= 0.26 but not Gymnasium can be tricky. I've found that the biggest issue is the spaces not matching up. We have a function gym_space_to_gymnasium_space that can be used to (attempt to) convert spaces from Gym to Gymnasium.
The AbmarlWrapper can be used for Abmarl environments. See baselines/abmarl for examples.
The ParallelZooWrapper can be used for PettingZoo environments. See baselines/pettingzoo for examples.
All environments must be wrapped in the PPOEnvironmentWrapper. If you're using a custom environment that doesn't conform to supported standards, you can create your own wrapper that inherits from PPOEnvironmentWrapper.
PPO-AF was created by Alister Maguire, maguire7@llnl.gov.
PPO-AF is open source, and contributing is easy.
- Create a branch with your changes.
- Make sure that your changes work. Add tests if appropriate.
- Open a pull request and add a reviewer.
The code of this site is released under the MIT License. For more details, see the LICENSE File.
LLNL-CODE-867112