Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@awjuliani
Last active October 25, 2022 07:57
Show Gist options
  • Star 39 You must be signed in to star a gist
  • Fork 30 You must be signed in to fork a gist
  • Save awjuliani/9024166ca08c489a60994e529484f7fe to your computer and use it in GitHub Desktop.
Save awjuliani/9024166ca08c489a60994e529484f7fe to your computer and use it in GitHub Desktop.
Q-Table learning in OpenAI grid world.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Q-Table Learning"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import gym\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load the environment"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [],
"source": [
"env = gym.make('FrozenLake-v0')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Implement Q-Table learning algorithm"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"#Initialize table with all zeros\n",
"Q = np.zeros([env.observation_space.n,env.action_space.n])\n",
"# Set learning parameters\n",
"lr = .8\n",
"y = .95\n",
"num_episodes = 2000\n",
"#create lists to contain total rewards and steps per episode\n",
"#jList = []\n",
"rList = []\n",
"for i in range(num_episodes):\n",
" #Reset environment and get first new observation\n",
" s = env.reset()\n",
" rAll = 0\n",
" d = False\n",
" j = 0\n",
" #The Q-Table learning algorithm\n",
" while j < 99:\n",
" j+=1\n",
" #Choose an action by greedily (with noise) picking from Q table\n",
" a = np.argmax(Q[s,:] + np.random.randn(1,env.action_space.n)*(1./(i+1)))\n",
" #Get new state and reward from environment\n",
" s1,r,d,_ = env.step(a)\n",
" #Update Q-Table with new knowledge\n",
" Q[s,a] = Q[s,a] + lr*(r + y*np.max(Q[s1,:]) - Q[s,a])\n",
" rAll += r\n",
" s = s1\n",
" if d == True:\n",
" break\n",
" #jList.append(j)\n",
" rList.append(rAll)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"print \"Score over time: \" + str(sum(rList)/num_episodes)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print \"Final Q-Table Values\"\n",
"print Q"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
@obalcells
Copy link

Hello Juliani, thanks for the nice post in Medium. I know this code is already very old, but I still wanted to ask you a question anyways. When you update the QValue of the state you took the action in Q[s,a] = Q[s,a] + lr*( r + y*np.max(Q[s1,:1]) - Q[s,a] ) you are in theory multiplying gamma by the expected future rewards after you've taken action a, however in the code you multiply gamma by the biggest value in the next state's q values np.max(Q[s1,:]). Am I understanding something wrong about "plus the maximum discounted (γ) future reward expected according to our own table for the next state (s’) we would end up in" or is there a mistake in the code? (I'm probably wrong haha)

@alexandervandekleut
Copy link

Hey! I was trying to figure out why my implementation of this wasn't working and I found out that this code only works if you add noise. Even epsilon-greedy approaches fail to get any reward. Removing + np.random.randn(1,env.action_space.n)*(1./(i+1))) results in 0 reward. I understand the importance of visiting as many s, a pairs as possible, but it seems strange to me that this process working at all depends heavily on noise.

@tykurtz
Copy link

tykurtz commented Jan 14, 2019

@alexandervandekleut

It makes sense that the randomness is necessary. If there's no randomness, then a = np.argmax(Q[s,:]) always returns 0 (or move left) as Q is initialized with all zeros in this setup. Since the reward is only ever given if the goal is reached and not from intermediate goals, there will never be any feedback to update Q unless at some point the agent reaches the goal. This isn't possible if it never tries to move right.

@axb2035
Copy link

axb2035 commented May 7, 2019

Thanks for the code. Question: Frozen lake returns r=1 when the agent reaches the last square and r=0 in all other cases. If rAll[] is meant to be the running sum of total reward shouldn't it be initialized before starting all the episodes? Otherwise it is redundant as you could just have rList.append[r]...

@j-w-yun
Copy link

j-w-yun commented May 26, 2019

Thanks for the code. Question: Frozen lake returns r=1 when the agent reaches the last square and r=0 in all other cases. If rAll[] is meant to be the running sum of total reward shouldn't it be initialized before starting all the episodes? Otherwise it is redundant as you could just have rList.append[r]...

rAll[] used to calculate the average reward per episode at the end of training, which I don't think is helpful. It makes more sense to instead calculate moving averages to track the change in the rate of success over training episodes. This info is helpful when coupled with historical values of j (number of steps in each episode) to visualize how the number of steps changes during training. I don't know why that part was commented out in the code. Anyways, in order to calculate the moving averages of the model's success rate, you either need to keep track of j or add up the rewards in each episode (although its 0 or 1 in this case, other problems may be more complex--such as chess, where the episode reward may vary according to enemy pieces killed, draw/loss, etc) and be able to refer to the index of the said episode through rAll[].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment