Read the paper.pdf for a detailed overview of the project.
This project reimplements key components of the Rainbow DQN architecture to train reinforcement learning agents for Atari games, specifically focusing on training efficiency using limited resources (Apple Silicon M3 Pro with 18 GB unified memory). Atari 2600 games are used as benchmarks to test the performance of reinforcement learning algorithms with high-dimensional inputs (raw pixels). The goal is to demonstrate that meaningful DRL research can be achieved without massive computational resources.
- Noisy Networks: Introduce parameterized noise into the network to improve exploration.
- N-Step Learning: Uses multi-step returns to provide a richer learning signal, improving long-term planning.
- Double DQN: Reduces overestimation bias by separating action selection and evaluation.
- Dueling Networks: Splits the estimation of state values and action advantages, enhancing learning efficiency.
- Distributional DQN: Models the distribution of rewards, allowing the agent to capture uncertainty in returns.
- Prioritized Experience Replay: Samples significant experiences more frequently, speeding up learning.
- We did not implement A2C (for resources and simplicity reasons), but it is included in the Rainbow DQN paper.
data/
: Stores training results, including model checkpoints and logs, organized by target update frequency (10K_TUF
and32K_TUF
).experiments/
: JSON configuration files for different experiments, organized by training setups like 10K TUF, 32K TUF, and transfer learning.imgs/
: Images and graphs for visualizations, such as training curves and game screenshots.notebooks/
: Jupyter notebook (plots.ipynb
) for generating training performance graphs and analysis.onnx/
: Models saved in ONNX format for compatibility and exporting (generated on Breakout).requirements.txt
: Python dependencies required for the project.scripts/
: Core scripts for training and testing:
This structure is designed to keep training, testing, and analysis organized, with clear separation between configuration files, model checkpoints, and utility functions.
The project explores training models for two games: Breakout and Assault, comparing Double DQN and Rainbow DQN across two different target update frequencies (32K and 10K). The Rainbow DQN outperforms Double DQN, especially in the more complex Assault game.
To assess the performance of our trained agents, we conducted a series of tests under standardized conditions. Each agent was evaluated over 500 episodes, following an initial warm-up phase of 100 episodes. Empirically, this warm-up improved the mean scores by allowing agents to stabilize their performance before formal evaluation.
- Warm-Up Games: 100 episodes.
- Evaluation Games: 500 episodes with 5 lives per game.
- Preprocessing: Identical to the training phase, including frame stacking and environment wrappers.
- Results with 32K Target Update Frequency (TUF):
Agent | Mean Score | Standard Deviation | Minimum Score | Maximum Score |
---|---|---|---|---|
Random | 1.364 | 1.394 | 0.0 | 7.0 |
Human | 31.8 | - | - | - |
Double DQN | 2.028 | 2.560 | 0.0 | 9.0 |
Rainbow DQN | 60.512 | 33.293 | 8.0 | 353.0 |
TrRainbow DQN | 71.668 | 37.784 | 18.0 | 314.0 |
- Results with 10K Target Update Frequency:
Agent | Mean Score | Standard Deviation | Minimum Score | Maximum Score |
---|---|---|---|---|
Random | 1.364 | 1.394 | 0.0 | 7.0 |
Human | 31.8 | - | - | - |
Double DQN | 163.254 | 78.810 | 30.0 | 385.0 |
Rainbow DQN | 58.594 | 22.192 | 23.0 | 257.0 |
- Results with 32K Target Update Frequency:
Agent | Mean Score | Standard Deviation | Minimum Score | Maximum Score |
---|---|---|---|---|
Random | 242.382 | 74.342 | 63.0 | 504.0 |
Human | 1496.0 | - | - | - |
Double DQN | 1673.888 | 544.315 | 651.0 | 4158.0 |
Rainbow DQN | 4883.154 | 2096.328 | 871.0 | 9792.0 |
TrRainbow DQN | 3104.678 | 963.984 | 849.0 | 7363.0 |
- Results with 10K Target Update Frequency:
Agent | Mean Score | Standard Deviation | Minimum Score | Maximum Score |
---|---|---|---|---|
Random | 242.382 | 74.342 | 63.0 | 504.0 |
Human | 1496.0 | - | - | - |
Double DQN | 2144.93 | 640.198 | 744.0 | 5098.0 |
Rainbow DQN | 4750.356 | 1852.385 | 776.0 | 9073.0 |
To enhance learning efficiency and agent performance, we implemented several environment wrappers using the Gymnasium framework:
- No-Op Reset: Randomizes the initial game state by executing a random number (1-30) of NOOP actions at the start of each episode, promoting robustness.
- Fire-On-Reset: Automatically issues a specific action (e.g., FIRE in Breakout) upon reset or after a life is lost, ensuring immediate interaction with the environment.
- Preprocessing of Gym Environment:
- Resizing: Downscaled frames to 84×84 pixels.
- Grayscale Conversion: Converted RGB images to grayscale.
- Normalization: Scaled images from 255-0 to 1.0-0.0.
- Frame Skipping: Repeats the agent's selected action for a fixed number of frames (typically 4), reducing computational load and focusing on longer-term strategies.
- Frame Stacking: Stacks the last four frames along the channel dimension (4, 84, 84) to provide temporal context, using lazy frame stacking to optimize memory usage.
- Enhanced Statistics Access: Exposes additional game statistics (e.g., lives remaining) to improve the agent's ability to detect significant events and assign accurate rewards.
- Terminal Signal on Life Loss: Treats life loss as a terminal state (not a reset), providing immediate feedback to the agent about the consequences of its actions.
We carefully selected hyperparameters to optimize the performance of both the Double DQN and Rainbow DQN implementations. They are inspired by the original paper.
Hyperparameter | Double DQN | Rainbow DQN |
---|---|---|
Learning Rate | 0.00025 | 0.00025 |
Discount Factor (γ) | 0.99 | 0.99 |
Replay Memory Size | 300,000 | 300,000 |
Mini-Batch Size | 32 | 32 |
Target Update Frequency | 32,000 or 10,000 | 32,000 or 10,000 |
Frame Skip | 4 | 4 |
Min Epsilon | 0.1 | N/A |
Max Epsilon | 1.0 | N/A |
Epsilon Decay Phase | 0.1 | N/A |
Max Steps | 5,000,000 | 5,000,000 |
Replay Start Size | 80,000 | 80,000 |
Save Frequency | 500,000 | 500,000 |
Vmin | N/A | -10 |
Vmax | N/A | 10 |
Number of Atoms (Natoms) | N/A | 51 |
Δz (Atom Gap) | N/A | (Vmax - Vmin)/(Natoms - 1) |
N-Step | N/A | 3 |
Alpha (α) | N/A | 0.5 |
Betastart | N/A | 0.4 |
Betaframes | N/A | Max Steps - Replay Start Size |
Training Graphs:
- Assault: Rainbow DQN performed significantly better than Double DQN, showing faster convergence and higher rewards.
- Breakout: Double DQN initially outperformed Rainbow DQN, but Rainbow DQN showed potential for superior performance with longer training.
Using pre-trained weights from one game, transfer learning was applied to another game. This resulted in faster convergence and better performance, demonstrating the value of shared feature representations in similar environments. We utilized 32K TUF models and hyperparameter for transfer learning and trained on 5M steps.
Install the required packages with:
pip install -r requirements.txt
To train models, use:
python scripts/train.py --config experiments/<config_file>.json
To test trained models:
python scripts/test.py --config experiments/<config_file>.json --checkpoint_folder <path_to_checkpoint> --model_type {double,rainbow} --env_num <env_number>
-
Importance of Hyperparameter Tuning: The impact of the Target Update Frequency (TUF) is game-dependent. In Assault, a higher TUF (32K) led to more stable learning and better long-term performance, while in Breakout, the simpler Double DQN benefited from a lower TUF (10K) for faster convergence. This highlights the need to adjust hyperparameters based on the game's complexity.
-
Rainbow DQN’s Advantage in Complex Games: Rainbow DQN’s architecture, combining multiple improvements (e.g., noisy networks, multi-step learning, distributional Q-learning), showed its strength in more complex games like Assault, where it outperformed Double DQN by a large margin. The integrated improvements are particularly useful in environments requiring deeper exploration and learning from long-term rewards.
-
Overfitting and Underfitting Dynamics: Lower TUF values (10K) helped achieve faster initial performance, but led to overfitting in the long run, especially in more complex environments like Assault. In contrast, higher TUF (32K) prevented overfitting, offering more generalization in extended training sessions.
-
Transfer Learning Effectiveness: Transfer learning from one game to another allowed the models to leverage shared visual patterns and dynamics, accelerating the learning process. This demonstrates the potential of building generalized agents capable of adapting to different games or tasks by transferring learned knowledge.
-
Frame Stacking and Lazy Frames: Using four-frame stacking enabled the agent to capture motion information in dynamic environments efficiently, and lazy frame implementation optimized memory usage. This was crucial in achieving faster training without compromising performance, especially given limited computational resources.
-
Longer Training Duration: Extending training steps beyond 5 million could further explore the potential of Rainbow DQN, particularly in less complex games like Breakout where the architecture benefits from prolonged training.
-
Game-Specific Tuning: Fine-tuning hyperparameters like TUF and learning rate for each game can further enhance performance, especially as complexity varies significantly across Atari games.
-
Exploring More Components: Conducting ablation studies to evaluate the individual contributions of Rainbow DQN’s components (e.g., noisy networks, dueling networks) could provide insights into which improvements drive performance gains in specific environments.
On a MacBook Pro with an M3 Pro chip (MPS device in PyTorch) and 18 GB of unified memory with default hyperparameters, when memory replay is filled, the following performance was observed :
- training double DQN ~400 steps/s (or ~3h30 for 5M steps)
- training on Rainbow DQN ~200 steps/s (or ~7h for 5M steps)
This can be improved by using a CUDA device, which is not available on Apple Silicon M3 Pro. On Kaggle, the training speed was improved, as long as the environments were not parallelized, for this use a powerful CPU.
- Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529–533. Nature
- Hessel, M., Modayil, J., van Hasselt, H., et al. (2018). Rainbow: Combining Improvements in Deep Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). arXiv
- Fortunato, M., Azar, M. G., Piot, B., Menick, J., Blundell, C., Legg, S., & Wierstra, D. (2017). Noisy Networks for Exploration. arXiv
- Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized Experience Replay. arXiv
- Van Hasselt, H., Guez, A., & Silver, D. (2016). Deep reinforcement learning with double Q-learning. In Proceedings of AAAI Conference on Artificial Intelligence (pp. 2094–2100).
- Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., & de Freitas, N. (2016). Dueling Network Architectures for Deep Reinforcement Learning. arXiv
- Bellemare, M. G., Dabney, W., & Munos, R. (2017). A Distributional Perspective on Reinforcement Learning. arXiv
- Sutton, R.S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9–44. DOI
- Chen, Y., Liu, Z., Yan, J., Li, H., Jin, O., & Yang, Q. (2020). Pre-training Tasks for Embedding-based Large Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 3755-3765).