Poster and Code for the project in Reinforcement Learning course of the MSc in Artificial Intelligence at the University of Amsterdam. Joint project of Gabriele Bani, Andrii Skliar, Gabriele Cesa and Davide Belli
Using single human demonstration has been shown to outperform humans and beat state of the art models in hard exploration problems [Learning Montezuma's Revenge from a Single Demonstration].
However, it takes an experienced professional to provide good demonstration to the model, which might be impossible in real problems. It might also be difficult to obtain optimal demonstrations. Can we still learn optimal policies from sub-optimal demonstrations?
Basic idea: divide the trajectory in n splits. Train on the last one until convergence, then select the previous split. Repeat until the first split, so to learn from increasingly difficult exploration problems.
Figure: Returns over episodes in Maze (left), MounainCar (middle) and LunarLander (right).
- Non optimal demonstrations can lead to optimal results, but better demonstrations lead to better learning and give more reliable
- In Maze, using bad demonstrations rather than suboptimal ones results in a better final policy because of a higher degree of exploration.
- With more complex environments, we expect demonstrations to allow for a much faster training than training from scratch.
- The current implementation is very sensitive to hyperparameter choices; there is a need for a more automatic and reliable version of the backward algorithm to overcome this issue.
Copyright © 2018 Gabriele Bani.
This project is distributed under the MIT license. This was developed as part of the Reinforcement Learning course taught by Herke van Hoof at the University of Amsterdam. Please follow the UvA regulations governing Fraud and Plagiarism in case you are a student.