TODO

Environment PongNoFrameskip-v4

Goals of the repo:

Demystify RL algorithms by providing minimal, pytorch object-oriented implementations and it's accompanying pseudocode and explanation
- I also provide quick explanations on typical Pytorch tricky manipulations, like .squeeze() or .detach() (at the end of repo)
Support my theory notes
Practice implementing algorithms

Features:

Accompanying theory notes 📕
Minimal and Object Oriented code for simple (Semi Gradient Sarsa) and state of the art algorithms (PPO-Clip)
Understandable and intuitive logging via experience tracking
Easy reward and loss plotting
Hyperparameter Tuning (in < 40 lines of code for all algorithms)*
Intuitive terminal interface (in < 50 lines)

Disclaimer:

These implementations aren't supposed to be used in research, but for full-transparency learning.
As so, no testing capabilities or pre-trained models are provided.

Algorithms

Algorithm	Lines of Code	Verified Environments
Semi Gradient Sarsa	< 100	`CartPole-v1`
Reinforce with Baseline	< 100	`CartPole-v1`
Deep Deterministic Policy Gradient (DDPG)	~ 150	`HalfCheetah-v2` , `Pendulum-v1`
N Step Actor Critic	< 150	`CartPole-v1`, `LunarLander-v2`
Double Deep Q Network (DDQN)	< 200	`CartPole-v1`, `LunarLander-v2`, `PongNoFrameskip-v4`
Proximal Policy Optimization (PPO)	< 200	`CartPole-v1`, `LunarLander-v2`, `Pendulum-v1`

Usage

For a complete description run

pyhton run.py -h

Train

python run.py --algo <algo> --env <env>

Outputs

<algorithm>
  └─logs
    └─<environment> 
      └─<experiment_id>
          ├─config.txt >> Contains agent configuration 
          ├─log.txt >> Stdout output (useful for customization)
          └─results.csv >> CSV of Rewards and Loss

For example running

python run.py --algo ddqn --env CartPole-v1

will yield

ddqn
  └─logs
    └─CartPole-v1 
      └─experiment_0
          ├─config.txt >> Contains agent configuration 
          ├─log.txt >> Stdout output (useful for customization)
          └─results.csv >> CSV of Rewards and Loss

Running again does not overwrite, but appends new experiments <algo>/logs/<env> folder
The files begin to be written as soon as the experiment starts. Hence interrupting via CTRL+C will still yield plottable results.
You can also delete last experiment by running the same command with -d or --delete flag

Tune / Optimize

This is mostly helpful if you plan on adding different environments. It uses optuna to run several hyperparameter combinations and picks the best.

python run.py --algo ddpg --env Pendulum-v1 --optimize --n-trials 100

Outputs

<algorithm>
  └─logs
    └─<environment>
      └─<experiment>
        └─trial_1
          ├─config.txt >> Contains agent configuration 
          ├─log.txt >> Stdout output (useful for customization)
          └─results.csv >> CSV of Rewards and Loss
        └─trial_2
          └─...
        └─...
        └─trial_<n_trials>
          └─...

You can cancel it with CTRL+C
For simplicity, all hyperparameter suggestions are done in core.optuna_create method. I'll leave the tweaking around for you.

Plot

Used for plotting losses and rewards

python run.py --algo <algo> --env <env> --plot <experiment>

Ex: python run.py --algo ppo --env CarPole-v1 --plot experiment_6

Would open

![ddqn_pong_results.png](ddqn_pong_results.png)

<experiment> can be ommited and it will use latest experience for specified algorithm and environment.

TODO

other improvements

Create Environment class with run_episode code

troubleshooting

In case you run into ROM license troubles when running PongNoFrameskip-v4, run

pip install "gym[atari,accept-rom-license]"

Be aware that this accepts the ROM license for you.

`Pytorch` tricks

.squeeze(...)
- This "squeezes" dim=1 of the array. Useful for when you're working with slightly different tensor shapes. Ex: tensor of dims [32,1].squeeze() -> [32,]
.unsqueeze(...)
- Kind of the reverse of .squeeze(), as it adds in one dimension. Ex: tensor of dims [32,].unsqueeze(1) -> [32, 1]
.view(...)
- Like .unsqueeze() but you could do weirder manipulations (change several dimensions). Ex tensor of shape [2,2] .view(4,1) -> [4,1]
detach()
- Detaches tensor from gradient calculations. Useful when you want to make predicitons, without backpropagating. ( Ex: DDQN target network)

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
_imgs		_imgs
_utils		_utils
ddpg		ddpg
ddqn		ddqn
n_step_actor_critic		n_step_actor_critic
ppo		ppo
reinforce_with_baseline		reinforce_with_baseline
semi_gradient_sarsa		semi_gradient_sarsa
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
core.py		core.py
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Goals of the repo:

Features:

Disclaimer:

Algorithms

Usage

Train

Tune / Optimize

Plot

TODO

other improvements

troubleshooting

`Pytorch` tricks

About

Releases

Packages

Languages

License

alramalho/understanding-rl

Folders and files

Latest commit

History

Repository files navigation

Goals of the repo:

Features:

Disclaimer:

Algorithms

Usage

Train

Tune / Optimize

Plot

TODO

other improvements

troubleshooting

Pytorch tricks

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`Pytorch` tricks

Packages