Skip to content

Part 1: 1. 4x4 grid environment 2. non-slippery environment 3. Q-learning Algorithm

License

Notifications You must be signed in to change notification settings

iiShreya/Q_Learning_with_FrozenLake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Q_Learning_with_FrozenLake

Click here for the Huggingface link with a video of the implementation

Alt Text

Q Learning Algorithm

Q = "the Quality" of that action at that state. Q-learning is a model-free, value based, off-policy Reinforcement Learning algorithm which uses the Temporal Difference Control approach of updating the action-value function at each step instead of at the end of the episode, to learn the value of an action in a particular state.

  • Definition Keypoints:
  1. Model-Free
  2. Off-Policy
  3. Uses TD Approach
  4. Value Based Method
  5. Control Problem

Model-Free Algorithm

These algorithms seek to learn the consequences of their actions through experience. Such an algorithm will carry out an action multiple times, learn from its actions and will adjust the policy (the strategy behind its actions) for optimal rewards, based on the outcomes.

Eg, Self-Driving Cars.

Model-Based Algorithm

In such an algorithm, an agent tries to understand its environment and creates a model for it based on its interactions with this environment. In such a system, preferences take priority over the consequences of the actions i.e. the greedy agent will always try to perform an action that will get the maximum reward irrespective of what that action may cause.

Eg, Playing Chess

Off-Policy Algorithm

Such an algorithm uses a diffrent policy for Acting and Updating.

Eg, In Q-Learning Algorithm:
Acting Policy: Epsilon Greedy Policy
Updating Policy: Greedy Policy for selecting the best next-state action value to update the Q-value.

On-Policy Algorithm

Such an algorithm uses the same policy for Acting and Updating.

Q-Learning Algorithm Pseudocode

Alt Text "Photo by Thomas Simonini"

Step 1: Initializing the Q Table

Step 2: Choosing action using epsilon greedy strategy

Step 3: Performing action At, gets reward Rt+1 and next state St+1

Step 4: Update Q(St, At)

Simply,

-Trains Q-Function (an action-value function) which internally is a Q-table that contains all the state-action pair values.
-Given a state and action, the Q-Function will search in its Q-table, the corresponding value.
-When the training is done, an optimal Q-function is obtained, which means an optimal Q-Table is obtained.
-Since there is an optimal Q-function, there is an optimal policy because for each state the best action to take is now known.

Q-Learning Equation

Alt Text

About

Part 1: 1. 4x4 grid environment 2. non-slippery environment 3. Q-learning Algorithm

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published