This repository deals with the practical application of the MADDPG multi-agent reinforcement learning model as part of the seminar "Advanced Topics in Reinforcement Learning" by the Neural Information Processing Group @ TU Berlin.
I attempt to apply different configurations of the DDPG (Lillicrap et al.) and MADDPG (Lowe et al.) models onto a traffic light grid environment. The papers of these two models can be found here (MADDPG) and here (DDPG).
All experiments rely heavily on the traffic simulation project flow ( see flow-project.github.io ) and the reinforcement learning library RLlib ( see https://ray.readthedocs.io/en/latest/rllib-env.html ).
I folked the flow repository and started to define my experiments ( see examples/rl/multiagent/ ) using the preexisting MultiTrafficLightGridPOEnv environment. Here is a gif, showing that environment type:
In all experiments, the environment consists of a 9x9-grid with one lane in each direction and cars constantly entering from the borders of the simulation, traversing the grid left-right / up-down (or vice versa). The traffic grid, looks something like this:
At each of the nine crossing one traffic light agent switches the light phases, allowing either left-right or up-down traffic at each simulation step. The traffic lights are either toggled according to a fixed strategy (see baseline measurement) or by a reinforcement learning agent.
The agents receive local observations about the speed and distance of the nearest cars as well as information about their neighbouring traffic lights ( see flow/envs/multiagent/traffic_light_grid ) (partially observable setting) and receive a reward signal about the average delay over all vehicles ( see flow/core/rewards.py ) plus a penality for standing vehicles ( see flow/envs/multiagent/traffic_light_grid.py ) in each timestep of the simulation.
The agents' action spaces simply consist of one continous action value in the range [0,1]. If it is below or equal 0.5, the environment will not change the light phase at a given intersection and if it is above 0.5, it will initiate a change of the traffic light phase.
As documented below, I have run a baseline measurement, two runs with a single agent switching all lights and am currently running multi agent experiments on the provide hardware (thanks to Vaios) by the neural information processing group.
https://docs.conda.io/en/latest/miniconda.html
conda create --name maddpg python=3.6 -y
conda activate maddpg
python3.6 setup.py develop
scripts/setup_sumo_osx.sh
#scripts/setup_sumo_ubuntu1804.sh
export SUMO_HOME="$HOME/sumo_binaries/bin"
export PATH="$SUMO_HOME:$PATH"
# verify that sumo is installed
sumo --version
pip uninstall ray -y
pip install https://ray-wheels.s3-us-west-2.amazonaws.com/master/2d97650b1e01c299eda8d973c3b7792b3ac85307/ray-0.9.0.dev0-cp36-cp36m-macosx_10_13_intel.whl
- Non-RL actuated lights baseline - Can be run by:
python simulate.py traffic_light_grid_edit
See rendered simulation and simulation metrics.
- PPO / SingleAgentEnv / Single policy - No artefacts stored.
python train_ppo.py singleagnet_traffic_light_grid
- TD3 (Successor of DDPG) / MutliAgentEnv / Single (shared) policy - Converged to policy where all traffic lights toggle synchronously.
python train_td3.py multiagent_ddpg
See rendered simulation and simulation metrics.
- DDPG / MultiAgentEnv / Multiple Policies / Shared Critic - Training terminated due to leaking experience buffer.
python train_ddpg.py multiagent_ddpg_multi
See rendered simulation and simulation metrics. When observing the simulation, we can see that this setup - employing local actor / mutual critic - leads to individual actions which we could argue demonstrates, that the setup picks up local optimisation signals while learning.
- MADDPG / MultiAgentEnv / Multiple Policies / Shared Critic - No learning occurs.
python train_maddpg.py multiagent_maddpg
Apparently there is a bug in the implementation. I talked to the code maintainer and together, we figured out that 1st exploration noise is missing and 2nd there is some weirdness within the action space output. Since the exploration noise implementation from DDPG was currently refactored into a general purpose noise API within ray, I decided to wait and try the earliest dev release enabling MADDPG to use that API.
- DDPG / MultiAgentEnv / Multiple Policies / Single Critic - Training terminated due to worker failure.
python train_ddpg_local_critic.py multiagent_ddpg_multi
- MADDPG / MultiAgentEnv / Multiple Policies / Shared Critic - Using adopted code.
python train_maddpg_noise.py multiagent_maddpg
- Updated flair to 0.9.0dev
- Integrated OrnsteinUhlenbeck exploration noise.
- Changed actor network output, which has been in the form of a OneHotCategorical Distribution over batch, to using sigmoid layer as seen in DDPG implementation.
- MADDPG / MultiAgentEnv / Multiple Policies / Shared Critic / Hyperparameter Opt
- Run MADDPG with hyperparameter search and custom maddpg_policy.py
All training results, model artefacts, learning curves, etc. for performed experiments are located within examples/results.
- SUMO simulation with trained policy / policies, can be started with:
python3.6 ../flow/visualize/visualizer_rllib.py data/trained_ring 200 --horizon 2000 --gen_emission
- Simlinking sumo
ln -s /usr/local/opt/proj/lib/libproj.19.dylib /usr/local/opt/proj/lib/libproj.15.dylib
Flow is a computational framework for deep RL and control experiments for traffic microsimulation.
See our website for more information on the application of Flow to several mixed-autonomy traffic scenarios. Other results and videos are available as well.
If you have a bug, please report it. Otherwise, join the Flow Users group on Slack! You'll recieve an email shortly after filling out the form.
We welcome your contributions.
- Please report bugs and improvements by submitting GitHub issue.
- Submit your contributions using pull requests. Please use this template for your pull requests.
If you use Flow for academic research, you are highly encouraged to cite our paper:
C. Wu, A. Kreidieh, K. Parvate, E. Vinitsky, A. Bayen, "Flow: Architecture and Benchmarking for Reinforcement Learning in Traffic Control," CoRR, vol. abs/1710.05465, 2017. [Online]. Available: https://arxiv.org/abs/1710.05465
If you use the benchmarks, you are highly encouraged to cite our paper:
Vinitsky, E., Kreidieh, A., Le Flem, L., Kheterpal, N., Jang, K., Wu, F., ... & Bayen, A. M, Benchmarks for reinforcement learning in mixed-autonomy traffic. In Conference on Robot Learning (pp. 399-409). Available: http://proceedings.mlr.press/v87/vinitsky18a.html
Flow is supported by the Mobile Sensing Lab at UC Berkeley and Amazon AWS Machine Learning research grants. The contributors are listed in Flow Team Page.