Skip to content

Instantly share code, notes, and snippets.

@patrickthompson
Created October 7, 2016 20:13
Show Gist options
  • Save patrickthompson/ad6f3ac3b6e439105326fc5ddb23accf to your computer and use it in GitHub Desktop.
Save patrickthompson/ad6f3ac3b6e439105326fc5ddb23accf to your computer and use it in GitHub Desktop.
To replicate my evaluation result for Taxi-v0, do the following in order:
Import Libraries:
import the gym library
import the sys library
import the random library
Set constants:
Set the current episode count to zero
Set gamma to .15
Set starting epsilon to .5
Set the epsilon decay at a rate of .999
Set global variables:
Set Q as a blank array
Set the environment variable to gym.make "Taxi-V1"
Populate Q array:
Append to the q array one blank array for every available space in gym.observation_space
Append to each blank q array one 0 list item for every available action in gym.action_space (should look like [0,0,0..] list)
Execute the learning loop:
While the episode count is less than 10,000, loop the following:
Set the [previous state] variable to zero
Set the [previous action] variable to zero
Set [observation] variable to env.reset()
While the test count is less than 10,000, loop the following:
Set the next action:
If the maximum of all actions in the current state is 0, then select a random action from all available actions
Otherwise,
If a random number between 0 and 1 is more than epsilon (starting .5), then select a random action
Otherwise, choose a "greedy" action:
Sort all of the available action values based on the current state, reverse order (largest is first in list)
Loop through all of the available actions
If the value of the available action in the loop matches the first item in the list, select the action id as our next action
Set the [previous state] to the current state
Set the [previous action] to the current action
Update the Q matrix:
Set the value of the q matrix in the old state and action (q[state][action] equal to the current reward + gamma times the maximum of q for the current action row
If we are 'done':
Set epsilon to epsilon * epsilon_decay (effectively reducing the odds of a random roll)
Increment the test variable
Increment the episode variable
Crack open a beer! Is anyone reading this? I doubt it. Prove me wrong: If you have read this, message me.
Patrick
@guillaumeBellec
Copy link

guillaumeBellec commented Oct 12, 2016

I read your post.
Very simple Q-learning algorithm, I am happy it is working here.
I will try it out. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment