Reinforcement Learning is applied to Easy21. This is an assignment as part of David Silver's Reinforcement Learning Course at UCL. The assignment can be found here.
The agent played 1 Million games (episodes) to obtain the following Value function:
The optimal policy chosen by selecting the actions with the highest value:
The MSE of Q, the state-action function, over the course of episodic learning. For each lambda, 10,000 Episodes have been measured against the Monte-Carlo 1 Million state-action function, saved in Q.dill
Mean Squared Error after 1,000 episodes for different lambdas:
The optimal policy as derived from 10,000 episodes of TD(lambda = 0.3):
The matrix lookup-table approach of the previous models are replaced by coarse coding function approximator. This reduces the 420 state-action combinations down to 36.