Created
October 7, 2016 20:13
-
-
Save patrickthompson/ad6f3ac3b6e439105326fc5ddb23accf to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
To replicate my evaluation result for Taxi-v0, do the following in order: | |
Import Libraries: | |
import the gym library | |
import the sys library | |
import the random library | |
Set constants: | |
Set the current episode count to zero | |
Set gamma to .15 | |
Set starting epsilon to .5 | |
Set the epsilon decay at a rate of .999 | |
Set global variables: | |
Set Q as a blank array | |
Set the environment variable to gym.make "Taxi-V1" | |
Populate Q array: | |
Append to the q array one blank array for every available space in gym.observation_space | |
Append to each blank q array one 0 list item for every available action in gym.action_space (should look like [0,0,0..] list) | |
Execute the learning loop: | |
While the episode count is less than 10,000, loop the following: | |
Set the [previous state] variable to zero | |
Set the [previous action] variable to zero | |
Set [observation] variable to env.reset() | |
While the test count is less than 10,000, loop the following: | |
Set the next action: | |
If the maximum of all actions in the current state is 0, then select a random action from all available actions | |
Otherwise, | |
If a random number between 0 and 1 is more than epsilon (starting .5), then select a random action | |
Otherwise, choose a "greedy" action: | |
Sort all of the available action values based on the current state, reverse order (largest is first in list) | |
Loop through all of the available actions | |
If the value of the available action in the loop matches the first item in the list, select the action id as our next action | |
Set the [previous state] to the current state | |
Set the [previous action] to the current action | |
Update the Q matrix: | |
Set the value of the q matrix in the old state and action (q[state][action] equal to the current reward + gamma times the maximum of q for the current action row | |
If we are 'done': | |
Set epsilon to epsilon * epsilon_decay (effectively reducing the odds of a random roll) | |
Increment the test variable | |
Increment the episode variable | |
Crack open a beer! Is anyone reading this? I doubt it. Prove me wrong: If you have read this, message me. | |
Patrick | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I read your post.
Very simple Q-learning algorithm, I am happy it is working here.
I will try it out. Thanks