Rainbow DQN Reimplementation: Results & Technical Overview

Read the paper.pdf for a detailed overview of the project.

Introduction

This project reimplements key components of the Rainbow DQN architecture to train reinforcement learning agents for Atari games, specifically focusing on training efficiency using limited resources (Apple Silicon M3 Pro with 18 GB unified memory). Atari 2600 games are used as benchmarks to test the performance of reinforcement learning algorithms with high-dimensional inputs (raw pixels). The goal is to demonstrate that meaningful DRL research can be achieved without massive computational resources.

Key Reimplemented Components

Noisy Networks: Introduce parameterized noise into the network to improve exploration.
N-Step Learning: Uses multi-step returns to provide a richer learning signal, improving long-term planning.
Double DQN: Reduces overestimation bias by separating action selection and evaluation.
Dueling Networks: Splits the estimation of state values and action advantages, enhancing learning efficiency.
Distributional DQN: Models the distribution of rewards, allowing the agent to capture uncertainty in returns.
Prioritized Experience Replay: Samples significant experiences more frequently, speeding up learning.
We did not implement A2C (for resources and simplicity reasons), but it is included in the Rainbow DQN paper.

Description of Key Files and Directories:

data/: Stores training results, including model checkpoints and logs, organized by target update frequency (10K_TUF and 32K_TUF).
experiments/: JSON configuration files for different experiments, organized by training setups like 10K TUF, 32K TUF, and transfer learning.
imgs/: Images and graphs for visualizations, such as training curves and game screenshots.
notebooks/: Jupyter notebook (plots.ipynb) for generating training performance graphs and analysis.
onnx/: Models saved in ONNX format for compatibility and exporting (generated on Breakout).
requirements.txt: Python dependencies required for the project.
scripts/: Core scripts for training and testing:

This structure is designed to keep training, testing, and analysis organized, with clear separation between configuration files, model checkpoints, and utility functions.

Results

The project explores training models for two games: Breakout and Assault, comparing Double DQN and Rainbow DQN across two different target update frequencies (32K and 10K). The Rainbow DQN outperforms Double DQN, especially in the more complex Assault game.

Evaluation Protocol

To assess the performance of our trained agents, we conducted a series of tests under standardized conditions. Each agent was evaluated over 500 episodes, following an initial warm-up phase of 100 episodes. Empirically, this warm-up improved the mean scores by allowing agents to stabilize their performance before formal evaluation.

Warm-Up Games: 100 episodes.
Evaluation Games: 500 episodes with 5 lives per game.
Preprocessing: Identical to the training phase, including frame stacking and environment wrappers.

Evaluation Metrics

Performance After 5 Million Steps

Breakout

Results with 32K Target Update Frequency (TUF):

Agent	Mean Score	Standard Deviation	Minimum Score	Maximum Score
Random	1.364	1.394	0.0	7.0
Human	31.8	-	-	-
Double DQN	2.028	2.560	0.0	9.0
Rainbow DQN	60.512	33.293	8.0	353.0
TrRainbow DQN	71.668	37.784	18.0	314.0

Results with 10K Target Update Frequency:

Agent	Mean Score	Standard Deviation	Minimum Score	Maximum Score
Random	1.364	1.394	0.0	7.0
Human	31.8	-	-	-
Double DQN	163.254	78.810	30.0	385.0
Rainbow DQN	58.594	22.192	23.0	257.0

Assault

Results with 32K Target Update Frequency:

Agent	Mean Score	Standard Deviation	Minimum Score	Maximum Score
Random	242.382	74.342	63.0	504.0
Human	1496.0	-	-	-
Double DQN	1673.888	544.315	651.0	4158.0
Rainbow DQN	4883.154	2096.328	871.0	9792.0
TrRainbow DQN	3104.678	963.984	849.0	7363.0

Results with 10K Target Update Frequency:

Agent	Mean Score	Standard Deviation	Minimum Score	Maximum Score
Random	242.382	74.342	63.0	504.0
Human	1496.0	-	-	-
Double DQN	2144.93	640.198	744.0	5098.0
Rainbow DQN	4750.356	1852.385	776.0	9073.0

Environment Wrappers

To enhance learning efficiency and agent performance, we implemented several environment wrappers using the Gymnasium framework:

No-Op Reset: Randomizes the initial game state by executing a random number (1-30) of NOOP actions at the start of each episode, promoting robustness.
Fire-On-Reset: Automatically issues a specific action (e.g., FIRE in Breakout) upon reset or after a life is lost, ensuring immediate interaction with the environment.
Preprocessing of Gym Environment:
- Resizing: Downscaled frames to 84×84 pixels.
- Grayscale Conversion: Converted RGB images to grayscale.
- Normalization: Scaled images from 255-0 to 1.0-0.0.
Frame Skipping: Repeats the agent's selected action for a fixed number of frames (typically 4), reducing computational load and focusing on longer-term strategies.
Frame Stacking: Stacks the last four frames along the channel dimension (4, 84, 84) to provide temporal context, using lazy frame stacking to optimize memory usage.
Enhanced Statistics Access: Exposes additional game statistics (e.g., lives remaining) to improve the agent's ability to detect significant events and assign accurate rewards.
Terminal Signal on Life Loss: Treats life loss as a terminal state (not a reset), providing immediate feedback to the agent about the consequences of its actions.

Hyperparameters

We carefully selected hyperparameters to optimize the performance of both the Double DQN and Rainbow DQN implementations. They are inspired by the original paper.

Hyperparameter	Double DQN	Rainbow DQN
Learning Rate	0.00025	0.00025
Discount Factor (γ)	0.99	0.99
Replay Memory Size	300,000	300,000
Mini-Batch Size	32	32
Target Update Frequency	32,000 or 10,000	32,000 or 10,000
Frame Skip	4	4
Min Epsilon	0.1	N/A
Max Epsilon	1.0	N/A
Epsilon Decay Phase	0.1	N/A
Max Steps	5,000,000	5,000,000
Replay Start Size	80,000	80,000
Save Frequency	500,000	500,000
V_min	N/A	-10
V_max	N/A	10
Number of Atoms (N_atoms)	N/A	51
Δz (Atom Gap)	N/A	(V_max - V_min)/(N_atoms - 1)
N-Step	N/A	3
Alpha (α)	N/A	0.5
Beta_start	N/A	0.4
Beta_frames	N/A	Max Steps - Replay Start Size

Training Graphs:

Key Observations

Assault: Rainbow DQN performed significantly better than Double DQN, showing faster convergence and higher rewards.
Breakout: Double DQN initially outperformed Rainbow DQN, but Rainbow DQN showed potential for superior performance with longer training.

Transfer Learning

Using pre-trained weights from one game, transfer learning was applied to another game. This resulted in faster convergence and better performance, demonstrating the value of shared feature representations in similar environments. We utilized 32K TUF models and hyperparameter for transfer learning and trained on 5M steps.

Technical Instructions

Setup

Install the required packages with:

pip install -r requirements.txt

Training

To train models, use:

python scripts/train.py --config experiments/<config_file>.json

Testing

To test trained models:

python scripts/test.py --config experiments/<config_file>.json --checkpoint_folder <path_to_checkpoint> --model_type {double,rainbow} --env_num <env_number>

Remarks and Insights

Key Insights

Importance of Hyperparameter Tuning: The impact of the Target Update Frequency (TUF) is game-dependent. In Assault, a higher TUF (32K) led to more stable learning and better long-term performance, while in Breakout, the simpler Double DQN benefited from a lower TUF (10K) for faster convergence. This highlights the need to adjust hyperparameters based on the game's complexity.
Rainbow DQN’s Advantage in Complex Games: Rainbow DQN’s architecture, combining multiple improvements (e.g., noisy networks, multi-step learning, distributional Q-learning), showed its strength in more complex games like Assault, where it outperformed Double DQN by a large margin. The integrated improvements are particularly useful in environments requiring deeper exploration and learning from long-term rewards.
Overfitting and Underfitting Dynamics: Lower TUF values (10K) helped achieve faster initial performance, but led to overfitting in the long run, especially in more complex environments like Assault. In contrast, higher TUF (32K) prevented overfitting, offering more generalization in extended training sessions.
Transfer Learning Effectiveness: Transfer learning from one game to another allowed the models to leverage shared visual patterns and dynamics, accelerating the learning process. This demonstrates the potential of building generalized agents capable of adapting to different games or tasks by transferring learned knowledge.
Frame Stacking and Lazy Frames: Using four-frame stacking enabled the agent to capture motion information in dynamic environments efficiently, and lazy frame implementation optimized memory usage. This was crucial in achieving faster training without compromising performance, especially given limited computational resources.

Future Directions

Longer Training Duration: Extending training steps beyond 5 million could further explore the potential of Rainbow DQN, particularly in less complex games like Breakout where the architecture benefits from prolonged training.
Game-Specific Tuning: Fine-tuning hyperparameters like TUF and learning rate for each game can further enhance performance, especially as complexity varies significantly across Atari games.
Exploring More Components: Conducting ablation studies to evaluate the individual contributions of Rainbow DQN’s components (e.g., noisy networks, dueling networks) could provide insights into which improvements drive performance gains in specific environments.

Code Performance

On a MacBook Pro with an M3 Pro chip (MPS device in PyTorch) and 18 GB of unified memory with default hyperparameters, when memory replay is filled, the following performance was observed :

training double DQN ~400 steps/s (or ~3h30 for 5M steps)
training on Rainbow DQN ~200 steps/s (or ~7h for 5M steps)

This can be improved by using a CUDA device, which is not available on Apple Silicon M3 Pro. On Kaggle, the training speed was improved, as long as the environments were not parallelized, for this use a powerful CPU.

References

Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529–533. Nature
Hessel, M., Modayil, J., van Hasselt, H., et al. (2018). Rainbow: Combining Improvements in Deep Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). arXiv
Fortunato, M., Azar, M. G., Piot, B., Menick, J., Blundell, C., Legg, S., & Wierstra, D. (2017). Noisy Networks for Exploration. arXiv
Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized Experience Replay. arXiv
Van Hasselt, H., Guez, A., & Silver, D. (2016). Deep reinforcement learning with double Q-learning. In Proceedings of AAAI Conference on Artificial Intelligence (pp. 2094–2100).
Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., & de Freitas, N. (2016). Dueling Network Architectures for Deep Reinforcement Learning. arXiv
Bellemare, M. G., Dabney, W., & Munos, R. (2017). A Distributional Perspective on Reinforcement Learning. arXiv
Sutton, R.S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9–44. DOI
Chen, Y., Liu, Z., Yan, J., Li, H., Jin, O., & Yang, Q. (2020). Pre-training Tasks for Embedding-based Large Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 3755-3765).

Appendix

Detailed training graphs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rainbow DQN Reimplementation: Results & Technical Overview

Introduction

Key Reimplemented Components

Description of Key Files and Directories:

Results

Evaluation Protocol

Evaluation Metrics

Performance After 5 Million Steps

Breakout

Assault

Environment Wrappers

Hyperparameters

Key Observations

Transfer Learning

Technical Instructions

Setup

Training

Testing

Remarks and Insights

Key Insights

Future Directions

Code Performance

References

Appendix

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
data		data
experiments		experiments
imgs		imgs
notebooks		notebooks
onnx		onnx
scripts		scripts
.gitignore		.gitignore
README.md		README.md
paper.pdf		paper.pdf
requirements.txt		requirements.txt

tgenlis83/dqn_paper_atari

Folders and files

Latest commit

History

Repository files navigation

Rainbow DQN Reimplementation: Results & Technical Overview

Introduction

Key Reimplemented Components

Description of Key Files and Directories:

Results

Evaluation Protocol

Evaluation Metrics

Performance After 5 Million Steps

Breakout

Assault

Environment Wrappers

Hyperparameters

Key Observations

Transfer Learning

Technical Instructions

Setup

Training

Testing

Remarks and Insights

Key Insights

Future Directions

Code Performance

References

Appendix

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages