ReLAx - Reinforcement Learning Applications
ReLAx is an object oriented library for deep reinforcement learning built on top of PyTorch.
- Implemented Algorithms
- Special Features
- Usage With Custom Environments
- Minimal Examples
- Installation
- Further Developments
- Known Issues
ReLAx library contains implementations of the following algorithms:
- Value Based (Model-Free):
- Model Based:
- Hybrid MB-MF
ReLAx offers a set of special features:
- Simple interface for lagging environment observations: Recurrent Policies for Handling Partially Observable Environments
- Sampling from parallel environments: Speeding Up PPO with Parallel Sampling
- Wide possibilities for scheduling hyper-parameters: Scheduling TRPO's KL Divergence Constraint
- Support of N-step bootstrapping for all off-policy value-based algorithms: Multistep TD3 for Locomotion
- Support of Prioritized Experience Replay for all off-policy value-based algorithms: Prioritised DDQN for Atari-2600
- Simple interface for model-based acceleration: DYNA Model-Based Acceleration with TD3 / MBPO with SAC
And other options for building non-standard RL architectures:
Some examples of how to write custom user-defined environments and use them with ReLAx:
import torch
import gym
from relax.rl.actors import VPG
from relax.zoo.policies import CategoricalMLP
from relax.data.sampling import Sampler
# Create training and eval envs
env = gym.make("CartPole-v1")
eval_env = gym.make("CartPole-v1")
# Wrap them into Sampler
sampler = Sampler(env)
eval_sampler = Sampler(eval_env)
# Define Vanilla Policy Gradient actor
actor = VPG(
device=torch.device('cuda'), # torch.device('cpu') if no gpu available
policy_net=CategoricalMLP(acs_dim=2, obs_dim=4,
nlayers=2, nunits=64),
learning_rate=0.01
)
# Run training loop:
for i in range(100):
# Sample training data
train_batch = sampler.sample(n_transitions=1000,
actor=actor,
train_sampling=True)
# Update VPG actor
actor.update(train_batch)
# Collect evaluation episodes
eval_batch = eval_sampler.sample_n_episodes(n_episodes=5,
actor=actor,
train_sampling=False)
# Print average return per iteration
print(f"Iter: {i}, eval score: {eval_batch.create_logs()['avg_return']}")
import torch
import gym
from relax.rl.actors import ArgmaxQValue
from relax.rl.critics import DQN
from relax.exploration import EpsilonGreedy
from relax.schedules import PiecewiseSchedule
from relax.zoo.critics import DiscQMLP
from relax.data.sampling import Sampler
from relax.data.replay_buffer import ReplayBuffer
# Create training and eval envs
env = gym.make("CartPole-v1")
eval_env = gym.make("CartPole-v1")
# Wrap them into Sampler
sampler = Sampler(env)
eval_sampler = Sampler(eval_env)
# Define schedules
# First 5k no learning - only random sampling
lr_schedule = PiecewiseSchedule({0: 5000}, 5e-5)
eps_schedule = PiecewiseSchedule({1: 5000}, 1e-3)
# Define actor
actor = ArgmaxQValue(
exploration=EpsilonGreedy(eps=eps_schedule)
)
# Define critic
critic = DQN(
device=torch.device('cuda'), # torch.device('cpu') if no gpu available
critic_net=DiscQMLP(obs_dim=4, acs_dim=2,
nlayers=2, nunits=64),
learning_rate=lr_schedule,
batch_size=100,
target_updates_freq=3000
)
# Provide actor with critic
actor.set_critic(critic)
# Run q-iteration training loop:
print_every = 1000
replay_buffer = ReplayBuffer(100000)
for i in range(100000):
# Sample training data (one transition)
train_batch = sampler.sample(n_transitions=1,
actor=actor,
train_sampling=True)
# Add it to buffer
replay_buffer.add_paths(train_batch)
# Update DQN critic
critic.update(replay_buffer)
# Update ArgmaxQValue actor (only to step schedules)
actor.update()
if i > 0 and i % print_every == 0:
# Collect evaluation episodes
eval_batch = eval_sampler.sample_n_episodes(n_episodes=5,
actor=actor,
train_sampling=False)
# Print average return per iteration
print(f"Iter: {i}, eval score: " + \
f"{eval_batch.create_logs()['avg_return']}, " + \
"buffer score: " + \
f"{replay_buffer.create_logs()['avg_return']}")
Installing into a separate virtual environment:
git clone https://github.com/nslyubaykin/relax
cd relax
conda create -n relax python=3.6
conda activate relax
pip install -r requirements.txt
pip install -e .
To install Mujoco do the following steps:
mkdir ~/.mujoco
cd ~/.mujoco
wget http://www.roboti.us/download/mujoco200_linux.zip
unzip mujoco200_linux.zip
mv mujoco200_linux mujoco200
rm mujoco200_linux.zip
wget http://www.roboti.us/file/mjkey.txt
Then, add the following line to the bottom of your bashrc:
export LD_LIBRARY_PATH=~/.mujoco/mujoco200/bin/
Finally, install mujoco_py itself:
pip install mujoco-py==2.0.2.2
!Note: very often installation crushes with error: error: command 'gcc' failed with exit status 1
.
To debug this run:
sudo apt-get install gcc
sudo apt-get install build-essential
And then again try to install mujoco-py==2.0.2.2
ReLAx package was developed and tested with gym[atari]==0.17.2. Newer versions also should work, however, its compatibility with provided Atari wrappers is uncertain.
Here is Gym Atari installation guide:
pip install gym[atari]==0.17.2
In case of "ROMs not found" error do the following steps:
- Download ROMs archive
wget http://www.atarimania.com/roms/Roms.rar
- Unpack it
unrar x Roms.rar
- Install atari_py
pip install atari_py
- Provide atari_py with ROMS
python -m atari_py.import_roms ROMS
In the future the following functionality is planned to be added:
- Curiosity (RND)
- Offline RL (CQL, BEAR, BCQ, SAC-N, EDAC)
- Decision Transformers
- PPG
- QR-DQN
- IQN
- FQF
- Discrete SAC
- NAF
- Stochastic environment models
- Improving documentation
- Lack of documentation (right now compensated with usage examples)
- On some systems
relax.zoo.layers.NoisyLinear
seems to leak memory. This issue is very unpredictable and yet not fully understood. Sometimes, installing different versions of PyTorch and CUDA may fix it. If the problem persists, as a workaround, consider not using noisy linear layers. - Filtering & Reward Weighted Refinement declared performance in paper is not yet reached
- DYNA-Q is not compatible with PER as it is not clear which priority to assign to synthetic branched transitions (possible option: same priority as its parent transition)