easy21

This repository implements the assignment requirements for the reinforcement learning course given by David Silver [1]. It implements a reinforcement learning environment and four different agents, namely monte-carlo, sarsa lambda, linear value function approximation, and neural network policy gradient, for simple card game called Easy21, presented in [2].

Setting up the environment

To setup the python environment you need Python v3.x, pip, and virtualenv:

git clone https://github.com/hartikainen/easy21.git
cd easy21

pip install virtualenv
virtualenv .venv
source .venv/bin/activate

pip install -r requirements.txt

Running the game

To run the game and test the agents, run the easy21.py file as follows:

python easy21.py [-h] [-v VERBOSE] [-a {mc,sarsa,lfa}]
                 [--num-episodes NUM_EPISODES] [--lmbd LMBD] [--gamma GAMMA]
                 [--plot-v PLOT_V] [--dump-q DUMP_Q]
                 [--plot-lambda-mse PLOT_LAMBDA_MSE]
                 [--plot-learning-curve PLOT_LEARNING_CURVE]

See easy21.py for more information about the running the game and testing the agents. All the agents are found in the /agents folder.

Easy21 Environment

The Easy21 environment is implemented Easy21Env class found in environment.py. The environment keeps track of the game state (dealer card and sum of player cards), and exposes a step method, which, given an action (hit or stick), updates its state, and returns the observed state (in our case observation is equivalent to the game state) and reward.

Monte-Carlo Control in Easy21

Monte-Carlo control for Easy21 is implemented in file agents/monte_carlo.py. The default implementation uses a time-varying scalar step-size α_t = 1/N(s_t, a_t) and ε-greedy exploration strategy with ε_t = N₀ / (N₀ + N(S_t)), where N₀ = 100, N(s) is the number of times that state s has been visited, N(s,a) is the number of times that action a has been selected from state s, and t is the time-step. The figure below presents the optimal value function V* against the game state (player sum and dealer hand).

TD Learning in Easy21

File agents/sarsa.py implements a Sarsa(λ) control for Easy21. It uses the same step-size and exploration schedules as the Monte-Carlo agent described in the previous section. The agent is tested with parameter values λ ∈ {0, 0.1, 0.2, ..., 1}, each ran for 20000 episodes. The first figure below present the learning curve, i.e. the mean-squared error vs. 'true' Q values against episode number, for each lambda. The next two figures plot the function V* (same as in Monte-Carlo section) for λ=0.0 and λ=1.0.

Bootstrapping in Easy21

As with any other situation, also in Easy21, bootstrapping reduces the variance of the learned policy, with the cost of increased bias. The Sarsa(λ) agent learns reasonable policy faster (i.e. in smaller number of episodes) than Monte-Carlo agent. I a game as simple as Easy21, however, it is feasible to run enough episodes for the Monte-Carlo agent to converge to the optimal unbiased policy.

The episodes in Easy21 on average last longer than in traditional Blackjack game because of the subtractive effect of red cards. Because of this, boostrapping is likely to be more useful in Easy21 than it would be in traditional Blackjack game.

Linear Function Approximation in Easy21

File agents/function_approximation.py implements a value function approximator with coarse coding for Easy21, using binary feature vector φ(state,action) with 36 (3*6*2) features. Each binary feature takes value 1 if (state, action) lies in the cuboid of state-space corresponding to that feature, and the action corresponding to that feature, and 0 otherwise. The cuboids are defined in the variable CUBOID_INTERVALS in agents/function_approximation.py.

Similarly to the Sarsa(λ) in the previous section, we run tests with 20000 episodes for parameter values λ ∈ {0, 0.1, 0.2, ..., 1}, with constant step-size α=0.01 and exploration value ε=0.05. The figures below plot the learning curve for each lambda, and the function V* for λ=0.0 and λ=1.0.

Notes for function approximation

Using approximation for the state and action space reduces the time and space complexity of the algorithm**, due to the reduced number of variables, corresponding to the states and actions, needed to learn by the agent. However, this comes with the cost of reduced accuracy of the learned state-value function Q (and thus value function V and policy π). It seems like the overlapping regions in the cuboid intervals result in more extreme values in some states. This happens because each state in the expanded Q function approximation can be affected by multiple states and actions through the weights of function approximation.

The effect of constant step-size α

One thing to notice is the effect of constant step-size used for the linear function approximation. Because the step-size is kept constant in the learning, some regions of the value function receive much less training than other parts. This results in incorrect value function in the extreme regions where the number of visits is small. This effect is tested with cuboid intervals corresponding to "identity" features, i.e. features that should result in exactly the same Q as learning with sarsa lambda. Even after 50k runs, the agent seems to have spots in the state space (low dealer and low player; low dealer and high player), where it doesn't match the sarsa results. This effect is presented in the figure below (for λ=0).

When using dynamic step-size α, as in the Sarsa(λ) section above, we can see that this effect disappears, and the function approximation results in expected approximation, as shown in the figure below (again, for λ=0).

** the actual running time is actually worse than with Sarsa(λ) because my function approximation implementation does not fully utilize numpy vectorization

Policy Gradient in Easy21

File agents/policy_gradient.py implements a simple two layer feed forward neural network and a policy gradient agent that utilizes the network. The network used by the agent uses 20 hidden neurons with ReLU non-linearities, and learning rate α=5*10^-4. The weights of the network are initialized from gaussian distribution with mean μ=0 and standard deviation σ=10^-4, and biases are initially set to 0. The figure below represents the running average (with 1000 episode window) of the reward received by the policy gradient agent over total of 100,000 episodes.

[1]http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html [2]http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/Easy21-Johannes.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

easy21

Setting up the environment

Running the game

Easy21 Environment

Monte-Carlo Control in Easy21

TD Learning in Easy21

Bootstrapping in Easy21

Linear Function Approximation in Easy21

Notes for function approximation

The effect of constant step-size α

Policy Gradient in Easy21

Files

README.md

Latest commit

History

README.md

File metadata and controls

easy21

Setting up the environment

Running the game

Easy21 Environment

Monte-Carlo Control in Easy21

TD Learning in Easy21

Bootstrapping in Easy21

Linear Function Approximation in Easy21

Notes for function approximation

The effect of constant step-size α

Policy Gradient in Easy21